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Chapter  1 
Introduction 

1.1  Markov  Decision  Processes 

Markov  Decision  Processes  (MDPs)  are  widely  used  for  modeling  and  describing 
sequence  decision  making  under  uncertainty  that  arises  in  various  areas  such  as  manufac¬ 
turing  systems,  financial  engineering,  artificial  intelligence,  and  operations  research.  An 
MDP  model  consists  of  four  principal  components:  a  state  space,  an  action  space,  the  ef¬ 
fects  of  the  actions  and  the  immediate  cost  incurred  by  the  actions.  The  relations  among 
these  components  are  illustrated  as  follows. 

Consider  a  decision  maker  that  interacts  simultaneously  with  his  environment  over  a 
finite  or  infinite  time  horizon  divided  into  a  sequence  of  stages  (decision  epochs).  At  each 
stage,  the  decision  maker  observes  the  state  of  the  environment,  where  it  is  assumed  that 
the  observation  is  complete  and  perfect;  based  on  his  observation,  a  decision  (an  action) 
is  made  to  react  to  the  environment.  The  decision  influences  (either  deterministically  or 
stochastically)  the  state  at  the  next  stage,  and  depending  on  the  state  and  the  decision 
made,  a  certain  cost  is  incurred.  The  expected  total  costs  accumulated  from  the  current 
stage  to  the  end  of  the  planning  horizon  is  called  a  value  function.  The  goal  of  the  decision 
maker  is  to  find  a  decision  rule /policy  specifying  the  best  action  to  take  for  each  of  the 
states,  so  that  he  can  act  optimally  with  the  changing  environment,  in  the  sense  that  the 
expected  total  (discounted)  cost  over  the  entire  planning  horizon  is  minimized. 

In  finite  horizon  problems,  the  optimal  decision  rules  (policies)  generally  depend  on 
both  the  stage  and  state;  they  can  be  computed  by  the  classical  dynamic  programming 
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(DP)  algorithm  starting  from  the  terminal  stage.  In  DP,  the  optimal  decisions  are  deter¬ 
mined  backwards  step  by  step  as  the  minimizers  of  a  functional  equation,  which  expresses 
the  value  function  at  the  present  stage  as  the  sum  of  the  one-stage  current  cost  and  the 
value  function  at  the  following  stage.  This  way  of  determining  the  optimal  policy  is  based 
on  Bellman’s  principle  of  optimality,  which  says,  “An  optimal  policy  has  the  property  that 
whatever  the  initial  state  and  initial  decision  are,  the  remaining  decisions  must  constitute 
an  optimal  policy  with  regard  to  the  state  resulting  from  the  first  decision”  (cf.  [63]). 

There  are  a  variety  of  solution  methods  for  solving  infinite  horizon  MDPs,  many 
of  which  can  be  viewed  as  different  strategies  for  solving  Bellman’s  equation.  The  two 
most  well-known  approaches  are  value  iteration  (VI)  and  policy  iteration  (PI).  Value 
iteration  is  essentially  the  extension  of  the  DP  algorithm  to  the  infinite  horizon  case;  it 
starts  with  an  arbitrary  (bounded)  function  and  updates  at  each  iteration  the  current 
function  into  a  new  function  that  better  approximates  the  optimal  value  function.  Thus 
the  algorithm  essentially  amounts  to  using  the  solution  to  a  finite  but  large  horizon  problem 
to  approximate  the  solution  to  the  infinite  horizon  problem.  As  an  alternative  to  VI,  policy 
iteration  starts  with  an  arbitrarily  chosen  stationary  policy  and  generates  a  sequence  of 
new  policies.  At  each  iteration  of  PI,  a  policy  evaluation  step  is  carried  out  to  compute 
the  value  function  associated  with  the  current  policy  as  the  solution  of  a  system  of  linear 
equations.  Once  this  value  function  is  obtained,  a  policy  improvement  step  is  used  to 
generate  a  new  policy  that  improves  the  performance  (in  terms  of  value  function)  of  the 
current  one.  The  process  is  repeated  until  no  further  improvement  can  be  achieved. 

There  are  also  various  straightforward  enhancements  of  VI  and  PI  for  solving  MDPs, 
including  the  methods  that  reduce  the  computational  cost  of  VI  and  PI  by  directly  ap¬ 
plying  the  standard  iterative  schemes  for  solving  systems  of  linear  equations  such  as  the 
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Gauss-Seidel  method  (cf.  e.g.,  [13]  and  [63])  and  the  successive  over  relaxation  (SOR) 
method  ([81]).  Puterman  and  Shin  [62]  proposed  a  modified  policy  iteration  algorithm, 
which  takes  the  basic  form  of  PI,  with  the  difference  being  that  the  policy  evaluation  step 
is  carried  out  only  approximately  by  executing  a  limited  number  of  value  iteration  steps. 
The  algorithm  combines  the  advantages  of  VI  and  PI,  and  thus  to  some  extent,  alleviates 
the  high  computational  burden  via  directly  (e.g.,  Gaussian  elimination)  solving  systems 
of  linear  equations  (i.e..  Bellman’s  equation). 

For  the  sake  of  completeness,  it  is  worth  mentioning  that  the  linear  programming 
(LP)  approach  has  also  long  been  established  as  a  useful  method  for  solving  infinite  horizon 
discounted  cost  MDPs  (cf.  [13],  [63]).  The  basic  idea  of  the  LP  approach  is  to  formulate 
the  Bellman’s  equation  as  a  set  of  linear  constraints  over  all  state-action  pairs  and  interpret 
the  optimal  value  function  as  the  “largest”  (in  a  minimization  context)  value  function  that 
satisfies  these  constraints. 

The  aforementioned  approaches  may  quickly  lead  to  computational  intractability, 
since  they  require  enumerating  the  entire  state  and  action  spaces,  which  often  grow  expo¬ 
nentially  fast  with  the  parameters  of  the  problem  (i.e.,  the  well-known  “curse  of  dimension¬ 
ality”).  In  order  to  address  this  issue,  many  researchers  have  used  various  approximation 
schemes  to  reduce  the  size  of  the  state/action  spaces. 

1.1.1  State  Space  Reduction  Techniques 

Bertsekas  and  Gastahon  [15]  proposed  a  class  of  adaptive  aggregation  algorithms  for 
solving  infinite  horizon  MDPs.  The  idea  is  to  group  the  states  of  the  original  problem  into 
a  smaller  number  of  aggregate  states  in  such  a  way  that  the  resulting  aggregated  states 
actually  constitute  a  smaller  MDP.  If  the  size  of  the  resultant  problem  is  small  enough. 


3 


then  its  value  function  can  be  computed  exactly  by  directly  solving  the  system  of  linear 
equations.  The  value  function  is  in  turn  used  to  approximate  the  value  function  of  the 
original  problem  by  using  some  deaggregation  schemes. 

Unlike  the  state  aggregation  approach,  some  other  approaches  have  concentrated  on 
approximating  the  value  function  via  a  suitable  parameterization,  in  effect  restricting  the 
search  to  a  smaller- dimensional  parameter  space  instead  of  the  entire  state  space.  The 
approximation  is  carried  out  via  a  number  of  different  techniques:  Bellman  et  al.  [11] 
explored  the  use  of  polynomial  approximations  as  compact  representations  of  the  value 
function  in  order  to  accelerate  dynamic  programming.  Schweitzer  and  Seidmann  [73] 
developed  several  techniques  for  approximating  value  functions  using  linear  combinations 
of  fixed  sets  of  basis  functions.  More  recently,  Tsitsiklis  and  Van  Roy  [83]  developed 
algorithms  that  employ  the  feature-based  compact  representations  of  the  value  function  in 
dynamic  programming.  One  of  their  algorithms  was  successfully  applied  to  play  the  Tetris 
game.  Trick  and  Zin  [82]  studied  approaches  based  on  linear  programming  for  solving  large 
MDPs  and  considered  the  use  of  low-dimensional  cubic-spline  approximations  to  the  value 
function.  In  De  Farias  and  Van  Roy  [27],  the  value  function  was  approximated  by  a  linear 
combination  of  pre-selected  basis  functions.  The  approach  was  used  in  conjunction  with 
linear  programming  for  approximately  solving  infinite  horizon  discounted  cost  problems. 

Another  class  of  methods  explores  the  use  of  Monte  Carlo  integration  to  avoid 
the  high  computational  cost  of  multivariate  numerical  integration  that  appears  in  the 
value  iteration  approach.  The  most  notable  work  in  this  area  is  due  to  Rust  [72],  who 
used  a  randomized  version  of  the  Bellman  operator  to  solve  a  class  of  MDPs  with  finite 
action  spaces  called  the  discrete  decision  processes  (DDP).  Rust  showed  that  (under  some 
regularity  conditions)  the  amount  of  computational  time  required  for  his  algorithms  to 
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solve  the  DDP  problem  increases  only  polynomially  rather  than  exponentially  with  the 


dimension  of  the  state  variables. 

All  the  computational  methods  mentioned  so  far  require  an  explicit,  complete  math¬ 
ematical  model  of  the  system  to  be  controlled,  represented  by  the  availability  of  the  cost 
structure  and  the  transition  probabilities.  There  is  a  class  of  methods,  on  the  other  hand, 
does  not  require  the  explicit  specification  of  the  transition  probabilities  and  one-stage 
costs.  Instead,  they  rely  on  the  use  of  Monte  Carlo  simulation  methods,  where  the  under¬ 
lying  system  can  be  simulated.  In  the  artificial  intelligence  community,  these  approaches 
are  often  referred  to  as  reinforcement  learning,  which  include  the  method  of  temporal 
difference  ( [80] )  and  Q-learning  ( [85] ) ,  as  well  as  certain  variations  and  extensions  of  them 
(cf.  e.g.,  [13]  for  a  review).  Recently,  there  have  been  some  new  and  exciting  ideas  that 
combine  the  use  of  the  specialized  MDP  techniques  with  the  solution  strategies  in  the  area 
of  global  optimization.  In  these  approaches,  the  simulation  techniques  are  used  not  only 
to  resolve  the  issue  of  the  unavailability  of  the  explicit  parameters  of  MDP  models,  but 
also  to  avoid  searching  (enumerating)  the  entire  (large  or  uncountable)  state  or  solution 
space.  Chang  et  al.  [20]  proposed  an  algorithm  based  on  the  idea  of  simulated  annealing 
([50])  for  solving  finite  horizon  MDPs.  The  algorithm  works  directly  on  the  policy  space 
and  iteratively  updates  a  probability  distribution  over  a  given  set  of  policies.  They  showed 
that  the  sequence  of  distributions  will  converge  to  a  distribution  concentrated  only  on  the 
optimal  policies.  A  similar  but  more  general  framework  was  also  proposed  in  [58],  where 
MDPs  with  several  reward  (cost)  criteria  are  formulated  as  global  optimization  problems 
over  the  set  of  all  admissible  policies,  and  are  thus  solved  by  using  the  cross-entropy  (CE) 
method  ([26],  [66],  [67],  [68]).  The  efficiency  of  their  approach  is  demonstrated  for  an 
inventory  control  problem  and  a  maze  problem. 
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1.1.2  Action  Space  Reduction  Techniques 


In  contrast  to  large  state  spaces,  the  issue  of  large  action  spaces  has  been  much 
less  explored.  It  was  partially  addressed  in  early  work  by  MacQueen  [57],  who  used 
some  inequality  forms  of  Bellman’s  equation  together  with  bounds  on  the  optimal  value 
function  to  identify  and  eliminate  non-optimal  actions  in  order  to  reduce  the  size  of  the 
action  sets  to  be  searched  at  each  iteration  of  the  algorithm.  Since  then,  the  procedure 
has  been  applied  to  several  standard  methods  like  policy  iteration  (PI),  value  iteration 
(VI)  and  modified  policy  iteration  (cf.  e.g.,  [63]  for  a  review).  In  a  recent  paper  [30],  the 
action  elimination  idea  has  been  explored  in  a  reinforcement  learning  context  where  the 
explicit  MDP  model  is  not  known.  So  far,  all  of  these  algorithms  generally  require  that 
the  admissible  set  of  actions  at  each  state  is  finite. 

1.2  Global  Optimization 

The  goal  of  global  optimization  is  to  find  parameter  values  that  achieve  the  optimum 
of  an  objective  function.  In  general,  due  to  the  presence  of  multiple  local  optimal  solutions, 
global  optimization  problems  are  typically  extremely  difficult  to  solve  exactly.  This  section 
briefly  reviews  some  of  the  standard  global  optimization  algorithms  with  an  emphasis 
on  general  solution  techniques  that  are  applicable  to  both  combinatorial  and  continuous 
optimization  problems. 

Methods  for  global  optimization  can  be  categorized  based  on  a  number  of  different 
criteria.  For  instance,  they  can  be  classified  either  based  on  the  properties  of  problems 
to  be  solved  (combinatorial  or  continuous,  nonlinear,  linear,  convex,  etc.)  or  by  the  types 
of  guarantees  that  the  methods  provide  for  the  final  solution.  The  classification  that  best 
fits  our  proposed  research  is  from  the  algorithmic  point  of  view,  where  solution  algorithms 
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are  categorized  as  being  either  instance-based  or  model-based;  cf.  [91]. 

1.2.1  Instance-based  Methods 

In  instance-based  methods,  the  searches  for  new  candidate  solutions  depend  explic¬ 
itly  on  previously  generated  solutions.  Some  well-known  approaches  are  simulated  an¬ 
nealing  (SA)  ([50]),  genetic  algorithms  (GAs)  ([79]),  tabu  search  ([35]),  and  the  recently 
proposed  nested  partitions  (NP)  method  ([75],  [76]). 

Simulated  annealing  was  initially  introduced  to  solve  combinatorial  optimization 
problems.  The  algorithm  starts  out  with  some  initial  configuration/solution,  and  the 
neighbors  (candidate  solutions)  of  the  current  solution  are  randomly  visited.  The  key 
idea  of  the  algorithm  is  that  neighbors  that  are  either  better  or  worse  than  the  current 
solution  may  both  be  accepted  with  a  certain  probability,  and  the  probability  of  accepting 
worse  solutions  gradually  decreases  during  the  search  process.  Thus  the  technique  gives  a 
simple  local  search  algorithm  the  possibility  to  escape  from  local  optimal  solutions.  The 
algorithm  was  later  extended  to  solve  continuous  optimization  problems  by  Corana  et  al. 
[24]. 

Genetic  algorithms  are  inspired  by  natural  selection  and  survival  of  the  fittest  in 
the  biological  world.  In  GAs,  a  population  rather  than  a  single  solution  is  considered. 
Each  iteration  of  the  algorithm  involves  a  “crossover”  and  a  “mutation” ,  where  promising 
solutions  are  recombined  with  other  solutions  by  swapping  parts  of  a  solution  with  another, 
and  are  then  “mutated”  by  making  a  small  change  to  the  solution.  The  rationale  is  that 
recombination  and  mutation  may  give  rise  to  new  solutions  that  are  biased  towards  regions 
containing  good  solutions. 

The  basic  idea  of  tabu  search  is  to  record  the  search  process,  so  that  a  search  path 
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already  visited  can  be  avoided.  This  insures  new  regions  of  the  solution  space  will  be 


investigated  with  the  goal  of  avoiding  local  minima  and  ultimately  finding  the  desired 
solution. 

The  nested  partitions  method  systematically  partitions  the  solution  space  into  smaller 
subregions,  accesses  the  potential  of  each  region  based  on  random  sampling,  and  concen¬ 
trates  the  computational  efforts  in  the  most  promising  region.  This  is  done  repeatedly 
until  some  of  the  regions  are  singleton  sets  (i.e.,  containing  only  one  solution).  In  some 
sense,  this  is  equivalent  to  changing  the  underlying  sampling  distribution  in  that  more 
promising  solutions  will  have  larger  chances  of  being  selected.  The  algorithm  is  shown  to 
converge  to  a  global  optimal  solution  with  probability  one. 

1.2.2  Model-based  Methods 

The  model-based  search  methods  are  a  class  of  new  solution  techniques  and  were 
introduced  only  in  recent  years.  In  model-based  algorithms,  new  solutions  are  generated 
via  an  intermediate  probabilistic  model  that  is  updated  or  induced  from  the  previously 
generated  solutions.  So  there  is  only  an  implicit /indirect  dependency  among  the  solutions 
generated  as  successive  iterations  of  the  algorithm.  In  general,  most  of  the  algorithms 
that  fall  in  this  category  share  a  similar  framework  and  usually  involve  the  following  two 
phases: 

1.  Generate  candidate  solutions  (random  samples,  trajectories)  according  to  a  specified 
probabilistic  model  (e.g.,  a  parameterized  probability  distribution  on  the  solution 
space) . 

2.  Update  the  probabilistic  model,  on  the  basis  of  the  data  collected  in  the  previous 


step,  in  order  to  bias  the  future  search  toward  “better”  solutions. 
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Figure  1.1:  Optimization  via  model-based  methods 

To  illustrate  how  model-based  methods  work,  we  consider,  in  Figure  1.1,  maximiz¬ 
ing  a  one-dimensional  multi-extremal  function  H (x) ,  where  its  global  optimum  is  achieved 
at  X  =  0.  The  model-based  methods  approach  this  problem  by  initially  casting  a  prob¬ 
ability  model  (distribution)  over  the  solution  space  (the  solid  curve  in  Figure  1.1).  This 
initial  distribution  is  then  used  to  generate  candidate  solutions/samples,  the  performance 
of  these  samples  are  evaluated  and  are  thus  used  to  update  the  initial  distribution  to 
obtain  a  new  distribution  (the  dashed  curve  in  the  figure).  The  preceding  procedure  is 
performed  repeatedly  until  some  stopping  criteria  is  satisfied.  The  underlying  idea  is  that 
if  these  probabilistic  models  are  updated  in  an  appropriate  way,  then  the  sequence  of 
samples/candidate  solutions  generated  will  become  more  and  more  concentrated  near  the 
optimum. 
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Some  well  established  techniques  that  belong  to  the  model-based  methods  are  the 
cross-entropy  (CE)  method  ([26], [58], [65], [66], [67], [68]),  a  class  of  algorithms  called  the 
estimation  of  distribution  algorithms  (ED As)  ([53], [59], [60]),  and  the  so-called  annealing 
adaptive  search  (AAS)  ([74], [89]).  The  CE  method  was  motivated  by  an  adaptive  algo¬ 
rithm  for  estimating  probabilities  of  rare  events  in  complex  stochastic  networks  ([65]), 
which  involves  variance  minimization.  It  was  soon  realized  ([66],  [67])  that  the  method 
can  be  modihed  to  solve  combinatorial  and  continuous  optimization  problems.  The  CE 
method  usually  starts  with  a  family  of  parameterized  probability  distributions  on  the  so¬ 
lution  space  and  tries  to  hnd  the  parameter  of  the  distribution  that  assigns  maximum 
probability  to  the  set  of  optimal  solutions.  Implicit  in  CE  is  an  optimal  reference  distrib¬ 
ution  concentrated  only  on  the  set  of  optimal  solutions  (i.e.,  zero  variance).  The  key  idea 
of  CE  is  to  use  an  iterative  scheme  to  successively  estimate  the  optimal  parameter  that 
minimizes  the  KL-divergence  between  the  optimal  reference  distribution  and  the  family 
of  parameterized  distributions.  The  literature  analyzing  the  convergence  properties  of  the 
CE  method  is  relatively  sparse.  In  the  context  of  estimation  of  rare  event  probabilities, 
Homem-de-Meho  ([41])  shows  the  convergence  of  a  variational  version  of  CE  to  an  es¬ 
timate  of  the  optimal  (possibly  local)  CE  parameter  with  probability  one.  Rubinstein 
([66])  shows  the  probability  one  convergence  of  the  CE  method  to  the  optimal  solution  for 
combinatorial  optimization  problems. 

The  estimation  of  distribution  algorithm  (EDA)  was  hrst  introduced  in  the  held  of 
evolutionary  computation  in  [59].  It  inherits  the  spirit  of  the  well-known  genetic  algo¬ 
rithms  (GAs),  but  eliminates  the  crossover  and  the  mutation  operators  in  order  to  avoid 
the  disruption  of  partial  solutions.  In  ED  As,  a  new  population  of  candidate  solutions  are 
generated  according  to  the  probability  distribution  induced  or  estimated  from  the  promis- 
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ing  solutions  selected  from  the  previous  generation.  Unlike  CE,  EDA  often  takes  into 


account  the  interrelations  between  the  underlying  decision  variables  needed  to  represent 
the  individual  candidate  solutions.  At  each  iteration  of  the  algorithm,  a  high-dimensional 
probabilistic  model  that  better  represent  the  interdependencies  between  the  decision  vari¬ 
ables  is  induced;  this  step  constitutes  the  most  crucial  and  difficult  part  of  the  method. 
We  refer  the  reader  to  [53]  for  a  review  of  the  way  in  which  different  probabilistic  models 
are  used  as  ED  As  instantiations.  The  convergence  of  a  class  of  ED  As,  under  the  infinite 
population  assumption,  to  the  global  optimum  can  be  found  in  [90] . 

In  annealing  adaptive  search  (AAS)  (cf.,  e.g.,  [89]),  there  is  a  sequence  of  distribu¬ 
tions  called  Boltzmann  distributions,  each  is  parameterized  by  a  temperature  parameter 
T.  One  salient  property  of  the  Boltzmann  distribution  is  that  when  T  decreases  to  0,  the 
sequence  of  Boltzmann  distributions  will  converge  to  a  degenerated  distribution  concen¬ 
trated  only  on  the  optimum.  So  the  idea  behind  AAS  is  that  if  we  can  repeatedly  sample 
from  the  Boltzmann  distribution  as  the  temperature  parameter  gradually  decreases  to 
0,  then  the  candidate  solutions/samples  generated  will  converge  to  the  global  optimum. 
However,  sampling  from  the  Boltzmann  distribution  is  extremely  difficult  if  not  possible, 
since  the  distribution  depends  on  the  objective  function  itself.  Thus  in  AAS,  the  research 
and  computational  efforts  have  mostly  centered  around  the  issue  of  how  to  efficiently  gen¬ 
erate  samples.  Currently,  one  popular  and  successful  sampling  approach  is  via  the  use 
of  Markov  Chain  Monte  Carlo  (MCMC)  [89],  but  the  distribution  of  the  samples  gen¬ 
erated  according  to  MCMC  can  only  be  guaranteed  to  converge  to  the  true  Boltzmann 
distribution  in  an  asymptotic  sense  ([89]). 
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1.3  Research  Contributions 


The  main  contributions  of  this  thesis  are  as  follows: 

•  We  have  developed  a  simulation-based  multistage  sampling  algorithm  for  solving 
finite  horizon  MDPs.  The  algorithm  is  motivated  by  the  computational  challenges 
arising  from  settings  where  some  of  the  parameters  of  the  MDP  models  are  either 
unknown  or  cannot  be  obtained  in  a  feasible  way.  We  have  assumed  that  the  under¬ 
lying  system  can  be  simulated  and  proposed  to  use  multi-armed  bandit  models  as 
efficient  tools  to  capture  the  tradeoff  between  sampling  a  promising  action  repeat¬ 
edly  and  exploring  further  other  actions  that  might  yield  even  greater  benefit,  so 
that  computational  resources  can  be  efficiently  allocated  in  an  adaptive  manner  as 
the  sampling  process  proceeds.  We  have  studied  the  convergence  properties  (includ¬ 
ing  rate  and  complexity)  of  the  algorithm  and  reported  on  computational  results 
to  illustrate  its  performance.  This  work  has  been  published  in  Operations  Research 
[22]. 

•  Our  second  contribution  complements  those  aforementioned  state  space  reduction 
techniques  (cf.  Section  1.1.1)  and  focuses  on  the  issue  of  large  action  spaces.  In  par¬ 
ticular,  we  have  proposed  a  novel  algorithm  that  uses  evolutionary,  population-based 
approaches  to  directly  searching  the  policy  space  in  order  to  avoid  carrying  out  an 
optimization  over  the  entire  action  space.  We  have  established  the  convergence  of 
the  algorithm  for  MDPs  with  finite  state  space  but  general  (Borel)  action  spaces  and 
compared  the  performance  of  the  algorithm  with  those  of  the  existing  techniques. 
Preliminary  empirical  results  on  a  queueing  example  indicated  that  the  proposed 
method  may  significantly  reduce  the  computational  effort  of  the  classical  PI  algo- 
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rithm.  A  slightly  different  version  of  this  work  has  been  accepted  for  publication  at 
INFORMS  Journal  on  Computing  [45]. 

•  We  have  also  proposed  a  new  general  framework  called  Model  Reference  Adaptive 
Search  (MRAS)  for  solving  global  optimization  problems,  which  addresses  the  most 
common  computational  difficulties  faced  by  many  model-based  methods.  We  have 
provided  a  particular  instantiation  of  the  framework  and  analyzed  its  global  conver¬ 
gence  properties.  We  have  studied  some  of  the  important  properties  of  the  recently 
proposed  CE  method  and  showed  that  the  CE  method  can  actually  be  interpreted 
as  an  instance  of  the  proposed  framework.  We  have  also  carried  out  detailed  nu¬ 
merical  studies  to  demonstrate  the  effectiveness  of  the  method  and  compared  its 
performance  with  those  of  CE  and  SA.  This  work  has  been  accepted  for  publication 
at  Operations  Research  [46] ;  a  preliminary  version  of  this  work  was  presented  at  the 
2005  Genetic  and  Evolutionary  Computation  Conference  (GECCO)  [43]. 

•  We  have  extended  the  MRAS  framework  to  stochastic  global  optimization  problems, 
derived  a  set  of  sufficient  conditions  to  ensure  the  global  convergence  of  the  method, 
and  tested  the  approach  on  several  benchmark  problems  such  as  {s,S)  inventory 
control  problem  and  optimal  buffer  allocation  problems  in  unreliable  production 
lines.  This  work  has  been  submitted  for  publication  [47];  a  much  abbreviated  version 
appeared  in  the  2005  Winter  Simulation  Conference  proceedings  [44]. 

The  rest  of  this  thesis  is  structured  as  follows. 

Chapter  2  provides  some  necessary  background  on  MDPs  and  global  optimization. 
Specihcally,  Chapter  2.1  gives  the  formal  dehnition  of  the  MDP  model  and  presents  the 
two  classical  approaches,  value  iteration  (VI)  and  policy  iteration  (PI),  for  solving  the 
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model.  Chapter  2.2  briefly  describes  two  of  the  recently  proposed  model-based  methods 
for  solving  global  optimization  with  an  emphasis  on  the  cross-entropy  (CE)  method,  which 
will  be  our  starting  points  for  deriving  results  of  Chapter  5  and  Chapter  6. 

In  Chapter  3,  we  introduce  a  simulation-based  algorithm  called  Adaptive  Multi¬ 
stage  Sampling  (AMS)  for  solving  finite  horizon  MDPs  with  finite  state  and  action  spaces. 
The  algorithmic  procedure  is  described  in  Chapter  3.2.  The  detailed  convergence  analysis 
is  given  in  Chapter  3.3.  In  Chapter  3.4,  we  perform  computational  experiments  on  a 
set  of  inventory  control  problems,  provide  two  additional  estimators,  and  discuss  the 
performance  of  different  estimators. 

In  Chapter  4,  we  propose  a  novel  algorithm  for  solving  a  class  of  problems  where  the 
state  space  is  relatively  small  but  the  action  space  is  large  or  uncountable.  The  chapter 
contains  a  detailed  description  of  the  proposed  algorithm  in  Chapter  4.3,  a  theoretical 
convergence  proof  of  the  algorithm  in  Chapter  4.4,  and  some  preliminary  empirical  results 
in  Chapter  4.6.  Along  the  discussion,  an  adaptive  version  of  the  proposed  algorithm  is 
also  considered  and  discussed  in  Chapter  4.5. 

In  Chapter  5,  we  propose  a  new  model-based  framework  for  solving  global  optimiza¬ 
tion.  A  specific  instantiation  of  the  framework,  in  its  deterministic  version,  as  well  as  its 
convergence  properties,  are  presented  and  established  in  Chapter  5.3,  whereas  the  corre¬ 
sponding  Monte  Carlo  version  of  the  method  is  described  and  its  convergence  proved  in 
Chapter  5.5.  We  explore  the  relationship  between  the  CE  method  and  the  proposed  frame¬ 
work  in  Chapter  5.4.  Preliminary  numerical  studies  are  also  carried  out  in  Chapter  5.6  to 
demonstrate  the  effectiveness  of  the  method. 

Chapter  6  summarizes  our  initial  idea  in  adapting  the  MRAS  framework  to  sto¬ 
chastic  domains.  In  particular,  we  provide  a  variational  extension  of  the  MRAS  method 
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in  Chapter  6.3,  prove  its  global  convergence  in  Chapter  6.4,  and  carry  out  numerical 


experiments  in  Chapter  6.5  to  verify  the  theoretical  findings. 

Finally,  we  conclude  the  thesis  in  Chapter  7  with  a  summary  of  the  work  done,  a 
discussion  of  the  unresolved  open  issues,  and  an  outline  of  some  possible  future  research 
topics. 
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Chapter  2 


Preliminaries 

2.1  Markov  Decision  Processes 

The  MDP  model  can  be  formally  described  by  a  five-tuple  M  =  {X,  A,{Pt,  t  = 
0, 1,  •  •  •},  {Rt,  t  =  0, 1,  •  •  •},  a),  where 

•  X  is  a  finite  set  of  states  of  the  environment. 

•  is  a  general  action  space. 

•  {Pt,  t  =  0, 1,  •  •  •}  is  a  sequence  of  state  transition  matrices,  each  maps  a  state-action 
pair  to  a  probability  distribution  over  the  state  space  X.  At  time  t,  the  probability 
of  transitioning  to  state  y  G  X,  given  that  we  are  in  state  x  G  X  taking  action  a  G  A, 
is  denoted  by  Px,y\t{'^),  i-e.,  the  {x,y)th.  entry  of  Pt- 

•  {Rt,  t  =  0,l,...}isa  sequence  of  bounded  non- negative  one-stage  cost  functions, 
where  at  time  t,  Rt  :  X  x  A  ^  iR"*"  U  {0}. 

•  a  G  (0, 1]  is  a  discount  factor. 

Let  xt,  t  =  0, 1, . . .,  a  random  variable  taking  its  values  in  X,  be  the  state  of  the  system 
at  time  t.  A  decision  rule  or  policy  is  a  sequence  of  functions  tt  :=  {yrt,  t  =  0, 1, . . .}  with 
each  TTt  :  X  ^  A  specifying  the  action  TTt{x)  taken  when  in  state  xt  =  x  G  X  at  time 
t.  Such  a  policy  is  called  stationary  if  all  its  components  are  independent  of  time,  i.e.,  it 
takes  the  form  tt  :=  {tt,  tt,  . . for  notational  brevity,  we  simply  denote  it  by  tt. 
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For  a  given  horizon  length  T  >  0,  a  given  policy  tt  =  {vrt,  t  =  0, 1, . . . ,  T  —  1}  and 
an  initial  state  xq,  a  particular  system  path  that  the  decision  maker  follows  is  given  by  a 
sequence  of  states  and  actions  {xq,  vro(xo), . . .  ,xt,  TTt{xt) ,  xt+i,  7rt+i{xt+i) ,  ■  ■ where  the 
transitioning  from  xt  to  xt+i  is  determined  by  the  probability  Pxt,xt+x\t{'^t{xt)) .  Thus,  the 
probability  of  taking  this  particular  path  can  be  calculated  as  nr=o'  and 

the  corresponding  accumulated  total  cost  can  also  be  expressed  as  Y^=o  0'^Pt{xt,Trt{xt))- 
Thus,  under  the  discounted  cost  criterion,  which  will  be  the  primary  focus  of  this  research, 
the  expected  total  accumulated  cost  over  all  possible  sample  paths  associated  with  tt  can 
be  expressed  as 


J^{x)  =  E 


T-l 

Rt{xt)  +  ^  a^Rtixt,  TTtixt))  I  Xo  =  X 
t=o 


X  G  cy  G  (0, 1] . 


If  the  horizon  length  T  =  oo,  we  assume  that  both  the  transition  probability  P  and  the 
one-stage  cost  function  R  are  stationary,  i.e.,  they  do  not  change  with  time  t.  We  therefore 
drop  the  explicit  display  of  t  in  both  P  and  i?,  and  write  the  expected  total  discounted 
cost  over  an  infinite  horizon  as 


J'^{x)  =  E 


a^R{xt,  TTt{xt))\  Xo  =  X 

.t=o 


,  X  G  X,  a  G  (0, 1), 


where  note  that  we  require  a  to  be  strictly  less  than  1  in  this  case. 

In  both  cases,  we  let  J*{x)  be  the  optimal  cost  function  starting  with  an  initial  state 
X,  defined  by 

J*(x)  =inf  J^(x),  xGX.  (2.1) 

We  also  call  a  stationary  policy  tt  optimal  if  J^(x)  =  J*(x)  V  x  G  X. 

For  finite  horizon  problems,  i.e.,  T  <  oo,  it  is  well-known  that  the  optimal  cost 
J*(x)  can  be  obtained  via  the  following  recursion  (cf.  e.g.,  [13]  Vol.  II). 
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Theorem  2.1.1  For  every  initial  state  x  G  X,  the  optimal  eost  J*{x)  is  equal  to  Jo{x), 
given  by  the  last  step  of  the  following  algorithm,  whieh  proeeeds  baekward  in  time  from 
stage  T  —  1  to  stage  0; 

Jt{x)  =  Rt{x),  Vx  G  X 

Jt{x)  =  min  Rt{x,at)  +  a'^P^^y\t{at)Jt+i{y)  ,  Vx  G  X,  t  =  0, . . . ,  T  -  1.  (2.2) 

L  y&X 

Furthermore,  if  a'l  =  7rJ‘(x)  minimizes  the  right  hand  side  of  equation  (2.2)  for  eaeh  x  and 
t,  the  poliey  tt*  =  {ttq,  . . . ,  is  optimal. 

For  infinite  horizon  problems,  i.e.,  T  =  oo,  the  optimal  cost  function  J*  satisfies  the 
following  Bellman’s  optimality  equation,  which  is  essentially  a  stationary  counterpart  of 
equation  (2.2). 

Theorem  2.1.2  Under  the  bounded  eost  assumption  and  a  G  (0, 1),  the  optimal  eost  J* 
satisfies 

J*(x)  =  min  R{x,  a)  +  a'S^  Px^y{a)J*  {y)  .  (2-3) 

asA  ^ ^  ’ 

L  y&X  J 

Note  that  for  simplicity,  we  have  assumed  in  equations  (2.2)  and  (2.3)  that  all  actions 
in  A  are  admissible  for  each  state  in  X. 

The  following  proposition  implies  the  existence  of  a  stationary  optimal  policy  when 
the  minimum  in  the  right  hand  side  of  Bellman’s  equation  is  attained  for  all  x  G  X. 

Proposition  2.1.1  A  stationary  poliey  vr  is  optimal  if  and  only  if  7r{x)  attains  the  mini¬ 
mum  in  Bellman’s  equation  (2.3)  for  all  x  G  X. 

Note  that  when  the  action  space  A  is  finite,  a  stationary  optimal  policy  is  guaranteed 
to  exist.  On  the  other  hand,  when  A  is  infinite,  we  can  also  ensure  the  existence  of  such  a 
policy  by  imposing  some  regularity  assumptions  on  A,  P,  and  R  such  that  the  minimum  in 
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equation  (2.3)  is  attained.  For  ease  of  exposition,  we  will  simply  assume  that  a  stationary 
optimal  policy  for  problem  (2.1)  always  exists  under  the  infinite  horizon  setting. 

We  now  briefly  describe  the  two  most  basic  approaches  for  solving  Bellman’s  equa¬ 
tion  in  an  infinite  horizon  setting:  value  iteration  (VI)  and  policy  iteration  (PI).  Their 
detailed  discussions  can  be  found  in  [13]  and  [63] . 

2.1.1  Value  Iteration 

VI  is  basically  the  dynamic  programming  (DP)  algorithm  and  is  a  principal  method 
for  computing  the  optimal  value  function  J* .  It  starts  with  an  arbitrary  bounded  function 
Jo(x)  Vx  G  V,  and  computes  at  each  iteration  A:  =  0, 1, ...  a  new  function  Jk+i{x)  V  x  G  V 
from  the  old  function  Jfc(x)  according  to 

Jfc+i(x)  =  min  i?(x,  a)  -h  a  Px,y{a)Jkiy)  ,  V  x  G  V.  (2.4) 

L  ydX  ^ 

It  is  well-known  that  under  some  mild  regularity  assumptions,  the  sequence  of  func¬ 
tions  {Jfc,  k  =  0,1,...}  generated  will  converge  to  the  optimal  value  function,  i.e., 
limfc^oo  Tfc(x)  =  J*{x)  V  X  €  X  (cf.  e.g.,  [13]  and  [63]).  VI  will  generally  require  in¬ 
finite  number  of  iterations  to  compute  the  optimal  value  function;  however,  in  practice, 
the  algorithm  can  often  be  strengthened  by  the  use  of  some  error  bounds.  It  can  be  shown 
(cf.  [13]  and  [63])  that  for  a  predetermined  tolerance  e  >  0,  if  j  Jfc+i(x)  —  Jfc(x)j  <  e  V  x  G  V 
for  some  k,  then  the  value  function  corresponding  to  the  greedy  policy  A  that  attains 
the  minimum  in  the  kth  iteration  of  equation  (2.4)  can  not  be  too  “far  away”  from  the 
optimal  value  function  J* ,  in  the  sense  that 

max  j  J^*'(x)  —  J*(x)l  <  2e—^ — . 

xGX  1  —  Ot 

The  above  error  bounds  often  provide  a  useful  guideline  for  terminating  the  VI  algorithm. 
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2.1.2  Policy  Iteration 


As  an  alternative  to  VI,  PI  starts  with  an  arbitrary  initial  stationary  policy  tto  and 
generates,  one  at  each  iteration,  a  sequence  of  stationary  policies  {tt®,  vr^,  tt^,  . . .}.  At  each 
iteration  k  =  0,1, . . the  following  two  steps  are  fundamental: 

k 

1.  Policy  evaluation  step  that  evaluates  the  value  function  associated  with  the 
current  (stationary)  policy  vr^. 


J'^  {x)  =  R{x,  7r^(x))  +  a  Px,y{T^^{x))J'^  (y),  V  x  G  A.  (2-5) 

ydX 


2.  Policy  improvement  step,  which  computes  a  new  improved  policy  as 


7r^'’'^(x)  =  argmin 


1&A 


R{x,  a)  +  a  Y,Px,y{a).r\y) 
yex 


,  VxGA. 


(2.6) 


It  can  be  shown  that  the  sequence  of  value  functions  has  the  following  (monotonicity) 
property  J^°(x)  >  J^^(x)  >  •  •  •  >  J*(x)  Vx  G  X.  Thus  the  sequence  of  policies  {tt^,  k  = 
0, 1, . . .}  generated  by  PI  is  improving.  Note  that  the  total  number  of  stationary  policies  is 
finite  whenever  the  action  space  is  finite.  In  this  particular  case,  we  will  have  (x)  = 
J'"  (x)  Vx  G  A  for  some  finite  k,  which  implies  that  PI  obtains  an  optimal  policy  vr*  in 
finite  number  of  iterations.  For  relatively  small  problems  (the  size  of  the  state  space  is 
less  than  10^),  policy  iteration  is  generally  regarded  as  the  fastest  method  for  computing 
the  optimal  value  function  and  the  associated  optimal  policy,  provided  that  the  discount 
factor  is  sufficiently  large  [70] . 


2.2  Global  Optimization 

We  consider  the  following  optimization  problem 

X*  G  argmax77(x),  x  G  A  C  (2.7) 
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where  X  is  the  solution  space,  and  H(-)  :  X  U  {0}.  We  assume  that  the  feasible 

region  X  is  unconstrained  (i.e.,  X  =  3^"')  or  is  subjected  to  relatively  simple  constraints 
so  that  the  random  samplings  can  be  done  easily  on  it;  for  instance,  X  is  a  finite  set  of 
alternatives  or  of  the  form  [ai,  bi]  x  [02,  62]  x  •  •  •  x  [o^n,  bn]- 

In  this  Chapter,  we  review  the  cross-entropy  (CE)  method,  estimation  of  distribu¬ 
tion  algorithms  (EDAs),  and  the  annealing  adaptive  search  (AAS)  for  solving  (2.7).  As 
mentioned  in  Chapter  1.2,  they  all  fall  within  the  framework  of  model-based  methods. 
One  of  the  most  important  features  of  a  model-based  approach  is  its  ability  to  learn  and 
adapt  during  the  search  process.  Initially,  the  approach  starts  from  a  global  perspective, 
and  gathers  information  about  the  “gross  behavior”  of  the  objective  function  by  random 
sampling  of  the  entire  feasible  region  X.  As  more  finer  details  of  the  cost  function  are 
revealed,  the  searches  (random  sampling)  are  getting  more  and  more  concentrated  on  sub- 
regions  of  X  containing  high  quality  solutions.  In  a  nutshell,  this  learning  process  consists 
of  the  following  two  steps: 

1 .  Generating  candidate  solutions  according  to  some  parameterized  probabilistic  model. 

2.  Modifying  the  parameters  of  the  model  by  using  the  candidate  solutions  in  order  to 
bias  future  sampling  toward  high  quality  solutions. 

Thus,  two  crucial  ingredients  for  any  model  based  approaches  are:  (1)  A  probabilistic 
model  that  allows  an  efficient  generation  of  candidate  solutions;  (2)  An  efficient  rule  for 
updating  the  parameters  of  the  model. 

2.2.1  The  Cross-Entropy  Method 

The  CE  method  starts  with  a  family  of  parameterized  probability  density /mass 
functions  {/(•;  9)  :  9  G  0}  over  X,  where  0  is  the  parameter  space.  Instead  of  directly 
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solving  (2.7),  the  algorithm  tries  to  solve  the  following  estimation  problem 

£(7)  =  Pe{H{X)  >  7)  =  EeI{H(x)>'y}^ 

where  X  is  a  random  vector  taking  values  in  X  with  p.d.f./p.m.f.  7  is  some 

parameter,  and 

/ 

1  if  77(x)  >  7, 

0  otherwise. 

Let  us  denote  the  maximum  of  (2.7)  by  H* .  The  goal  of  CE  is  to  find  an  optimal  parameter 
0*  so  that  the  p.d.f./p.m.f.  assigns  maximum  mass  to  the  set  of  (near)  optimal 

solutions  {x  :  H{x)  >  77*}.  Once  such  a  parameter  is  found,  the  resulting  p.d.f./p.m.f. 
can  be  used  to  generate  good  candidate  solutions  to  the  optimization  problem  with  high 
probability.  However,  if  7  is  close  to  77*,  then  typically  {77(X)  >  7}  is  a  rare  event,  and 
estimation  of  the  probability  7(7)  is  a  nontrivial  problem.  The  CE  method  breaks  down 
this  estimation  problem  into  a  sequence  of  simpler  estimation  problems  and  generates 
a  sequence  of  tuples  {(%,  Ok),k  =  0,1,...},  which  converges  (empirically)  quickly  to  a 
small  neighborhood  of  the  optimal  tuple  The  main  CE  optimization  algorithm 

is  summarized  as  follows. 

Algorithm  2.2.1  (Main  CE  Algorithm  for  Optimization)  Let  p  G  (0,1)  he  the 
fraetion  of  the  best  samples  that  will  be  used  in  parameter  updating,  and  N  be  the  number 
of  samples  at  eaeh  iteration. 

1.  Choose  the  initial  parameter  9q  G  0.  Set  the  iteration  eounter  fc  =  0. 

2.  Draw  random  samples  Xl, aeeording  to  f{-,9k) ■  Caleulate  the  sample  {1—p)- 
quantile  by  ordering  H {X}.^)  i  =  1, . . . ,  N  from  the  smallest  to  largest  and  then  setting 
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7fc  :=  -f^d'pTV]);  where  is  the  ith  order  statistie  of  the  ordered  sample  performanee 
and  \pN~\  indieates  the  integer  part  of  pN. 

3.  Caleulate  the  new  parameter  O^+i  by  solving  the  optimization  problem 

1  ^ 

4+1  :=  argmax  -  ^  4- 

4-  If  for  some  k  >  d,  say  d  =  5, 

'Jk  —  T/c— 1  —  *  *  *  —  ^k—di 

then  terminate;  otherwise  set  k  =  k  +  1  and  reiterate  from  Step  2. 

The  deterministic  version  of  Algorithm  2.2.1  is  also  presented  below. 

Algorithm  2.2.2  (Deterministic  Version  of  the  CE  Method) 

1.  Choose  the  initial  parameter  9q  G  0.  Set  k  =  0. 

2.  Caleulate  the  (1  —  p) -quantile  jk  os 

:=  max{/  :  P0j^{H{X)  >l)>p}. 

3.  Compute  the  new  parameter  by  solving  the  following  problem 

Ok+i  :=  argmaxEe^^  In /(A,  6*)]  . 

0e0 

4-  If  for  some  k  >  d,  say  d  =  5, 

Tfc  3k— 1  ■  ■  ■  3k— dt 

then  terminate;  otherwise  set  k  =  k  +  1  and  reiterate  from  Step  2. 
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2.2.2  The  Estimation  of  Distribution  Algorithms 

The  ED  As  were  first  introduced  in  the  field  of  evolutionary  computation.  However, 
unlike  evolutionary  algorithms,  they  do  not  rely  on  the  “genetic”  principle  anymore  (e.g., 
the  crossover  and  mutation  mechanisms  in  classical  evolutionary  algorithms);  instead, 
in  each  iteration,  they  build  an  explicit  probabilistic  model  (probability  distribution)  of 
promising  solutions  in  the  search  space.  New  candidate  solutions  are  created  by  sampling 
from  this  distribution.  In  a  general  level,  an  EDA  can  be  concisely  described  as  follows. 

Algorithm  2.2.3  (Estimation  of  Distribution  Algorithm)  Let  N  be  the  size  of  the 
population  at  eaeh  iteration. 

1.  Generate  the  initial  population  Dq  (N  eandidate  solutions)  randomly  (e.g.,  uni¬ 
formly)  from  the  solution  spaee.  Set  the  iteration  eounter  /c  =  0. 

2.  Construet  a  set  of  promising  solutions  by  seleeting  S  <  N  eandidate  solution 
from  Dk  aeeording  to  a  seleetion  seheme. 

3.  Estimate  Pk{x)  :=  P{x\D^)  for  all  x  G  A,  i.e.,  the  probability  distribution  of  solution 
X  being  among  the  seleeted  solutions  Df. 

4.  Construet  a  new  population  D^+i  by  sampling  N  eandidate  solutions  from  Pk{x). 

5.  If  a  stopping  eriterion  is  met,  then  terminate;  otherwise  set  k  =  k  +  1  and  reiterate 
from  step  2. 

The  performance  of  a  particular  EDA  are  mainly  determined  by  the  construction 
and  estimation  of  the  probabilistic  model  Pk{')-  More  accurate  models  ensure  better 
performance  of  the  algorithm,  however  they  are  often  more  complicated  and  expensive 
to  build.  In  combinatorial  domains,  if  the  random  vector  X  £  X  consists  of  n  discrete 
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variables,  i.e.,  X  =  {Xi,  X2,  ■  ■  ■ ,  Xn),  and  each  variable  Xi  can  take  on  m  values,  then  a 
complete  description  of  the  joint  probability  distribution  of  X  requires  —  1  parameters, 
and  to  estimate  all  these  parameters  is  clearly  impractical.  In  practice,  in  order  to  reduce 
the  number  of  parameters  used  to  represent  the  joint  distribution,  simplifying  assumptions 
are  made  about  the  structure  of  the  distribution.  For  instance,  consider  the  case  where 
n  =  3  and  m  =  3.  A  precise  description  of  the  joint  distribution  requires  26  parameters: 
2  for  the  distribution  of  A3,  6  for  the  conditional  distribution  P{X2  =  2/IA3  =  z),  and 
18  for  P(Ai  =  x\X2  =  y,  A3  =  z).  If  we  assume  that  given  A2,  Ai  is  independent  of 
A3,  then  only  14  parameters  are  required.  Finally,  if  all  variables  are  assumed  to  be 
independent,  then  the  joint  distribution  of  A  is  determined  by  the  univariate  marginal 
distribution  of  Ai,  A2,  and  A3,  which  in  turn  requires  only  6  parameters.  Thus,  as  we  can 
see,  there  is  often  a  tradeoff  between  accuracy  and  efficiency.  When  categorized  by  the 
complexity  of  the  underlying  probabilistic  models  employed,  there  are  a  number  of  different 
particular  instantiations  of  EDAs,  ranging  from  the  simple  Univariate  Marginal  Density 
Algorithm  (UMDA)  [59],  where  all  components  of  an  individual  solution  are  assumed  to 
be  independent,  to  Bayesian  Optimization  Algorithm  (BOA)  which  uses  Bayesian  nets  as 
the  probabilistic  model.  Please  refer  to  [53]  and  [60]  for  a  review. 

2.2.3  Annealing  Adaptive  Search 

The  annealing  adaptive  search  method  was  originally  developed  to  understand  the 
behavior  of  the  classical  simulated  annealing  algorithm.  The  method,  in  its  idealized 
form,  assumes  that  the  samples  can  be  generated  exactly  from  a  sequence  of  Boltzmann 
distributions  (in  a  maximization  context)  given  by 

f^e^(-y^^iy{dxy 
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where  v  is  the  Lebesgue  or  discrete  measure  on  the  solution  space,  and  is  the  tem¬ 
perature  parameter  at  the  kth  iteration,  which  is  usually  taken  to  be  a  function  (cooling 
schedule)  of  the  past  sample/candidate  solution  visited.  The  idealized  version  of  AAS, 
taken  from  [89],  is  presented  below. 

Algorithm  2.2.4  1.  Generate  a  solution  Xq  uniformly  from  the  solution  spaee  X.  Set 

k  =  0,  Yq  =  H[Xq),  y*  =  Yq,  a*  =  Xq,  and  Tq  =  r(A*),  where  r(-)  is  a  positive 
real-valued  nondeereasing  eooling  sehedule. 

2.  Generate  Xk+i  from  the  Boltzmann  distribution  with  temperature  parameter  Tk- 

3.  IfH{Xk+i)  >  Yk,  set  n+i  =  F(Afc+i),  y,  =  Y^+i,  X,  =  A^+i.  Set  T^+i  =  r(y). 
Otherwise,  set  Y^+i  =  Yf^  and  =  Tfc- 

4-  Set  k  =  k  +  1  and  return  to  Step  2  until  some  speeified  stopping  rule  is  satisfied. 

AAS  has  some  attractive  theoretical  properties.  For  example,  it  is  shown  in  [74]  that 
for  a  particular  cooling  schedule  of  the  temperature  parameter,  the  expected  number  of 
improving  samples/solutions  (in  terms  of  their  performance)  and  the  number  of  function 
evaluations  both  grow  only  linearly  with  the  problem  dimension.  However,  as  noted 
earlier,  in  order  to  implement  the  method  in  practice,  AAS  needs  to  be  used  in  conjunction 
with  various  efficient  sampling  techniques.  This  is  an  active  area  that  has  received  much 
attention  both  in  the  past  and  present.  Since  the  technical  details  is  beyond  the  scope  of 
this  research,  we  refer  interested  readers  to  the  work  of  [74]  and  [89]. 
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Chapter  3 


An  Adaptive  Multi-stage  Sampling  Algorithm  for  Solving  Finite  Horizon  Markov 
Decision  Processes 

In  this  chapter,  we  propose  a  simulation-based  framework  for  approximately  solving 
general  finite  horizon  MDPs  with  large  state  spaces.  For  a  given  MDP  with  horizon  T, 
the  method  can  be  interpreted  as  an  efficient  search  method  for  a  decision  tree  with  depth 
T,  where  each  node  of  the  tree  represents  a  state,  with  the  root  node  corresponding  to  an 
initial  state,  and  each  edge  of  the  tree  signifies  a  sampling  of  a  given  action.  The  method 
employs  a  depth  first  search  for  generating  sample  paths  from  the  initial  state  to  the  final 
state  (i.e.,  when  the  finite  horizon  T  is  reached)  and  uses  backtracking  to  estimate  the 
value  functions  at  previously  visited  states,  where  the  estimated  value  function  of  a  certain 
node/state  is  taken  to  be  the  weighted  average  of  the  Q-values  at  the  successive  child 
nodes/states.  We  show  that  the  estimated  value  function  at  the  initial  state  produced 
by  the  algorithm  not  only  converge  to  the  true  optimal  value  but  also  does  so  in  an 
“efficient”  way,  with  the  worst-case  bias  bounded  by  a  quantity  that  converges  to  zero 
at  rate  of  O  (Y1J=o  where  Nt  is  the  total  number  of  samples  that  are  used  per 

state  sampled  in  stage  t.  Given  that  the  action  space  size  is  |A|,  the  worst-case  running 
time-complexity  of  the  algorithm  is  O  ((|vl|  maxt=i^...^'r  ,  which  is  independent  of  the 

state  space  size  but  is  dependent  on  the  size  of  the  action  space  due  to  the  requirement 
that  each  action  be  sampled  at  least  once  at  each  sampled  state. 

A  similar  sampling  strategy  (i.e.,  the  recursive  tree  sampling  structure)  was  previ¬ 
ously  used  in  [49]  to  create  an  on-line,  near-optimal  planning  algorithm  for  solving  large 
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MDPs.  However,  their  approach  differs  from  ours  in  the  way  actions  are  sampled.  Their 


method  employs  a  straightforward  nonadaptive  sampling  scheme,  where  each  action  is 
always  sampled  for  a  prespecified  fixed  number  of  times.  Obviously  this  scheme  is  gener¬ 
ally  sub-optimal,  which  could  often  lead  to  a  waste  of  computational  resources,  especially 
when  the  computational  budget  is  tight.  Our  method,  in  contrast,  adaptively  chooses 
which  action  to  sample  as  the  sampling  process  proceeds,  and  concentrate  most  of  the 
sampling  on  the  action  with  high  variability,  which  could  yield  the  most  computational 
benefits  in  cases  where  the  sampling  cost  is  relatively  expensive. 

The  adaptive  sampling  idea  in  our  approach  originates  from  the  expected  regret 
analysis  of  the  multi-armed  bandit  problem  developed  by  [52].  In  particular,  we  exploit 
the  recent  finite-time  analysis  work  by  [8]  that  elaborated  [Ij.  The  objective  of  these 
problems  is  to  play  as  often  as  possible  the  machine  that  yields  the  highest  (expected) 
reward.  The  optimal  strategy  (policy)  must  balance  between  playing  the  machine  that  is 
empirically  best  thus  far  (exploitation),  i.e.,  the  machine  has  the  highest  sample  mean, 
and  trying  to  find  a  better  machine  (exploration)  that  actually  has  a  higher  expectation 
but  might  have  a  lower  sample  mean  thus  far  due  to  statistical  variation.  The  expected 
loss  due  to  not  always  playing  the  true  optimal  machine  is  called  regret,  which  quantifies 
the  exploration/exploitation  dilemma  in  the  search  for  the  true  (unknown  in  advance) 
“optimal”  machine.  Lai  and  Robbins  [52]  showed  that  for  an  optimal  strategy  the  regret 
grows  at  least  logarithmically  in  the  number  of  machine  plays,  and  recently  Auer  et  al.  [8] 
showed  that  the  logarithmic  regret  is  also  achievable  uniformly  over  time  with  a  simple  and 
efficient  sampling  algorithm  for  arbitrary  reward  distributions  with  bounded  support.  We 
incorporate  their  results  into  a  sampling-based  process  for  finding  an  optimal  action  in  a 
state  for  a  single  stage  of  a  finite  horizon  MDP  by  appropriately  converting  the  definition  of 
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regret  into  the  difference  between  the  true  optimal  value  and  the  approximate  value  yielded 
by  the  sampling  process.  We  then  extend  the  one-stage  sampling  process  into  multiple 
stages  in  a  recursive  manner,  leading  to  a  multi-stage  (sampling-based)  approximation 
algorithm  for  solving  MDPs. 

3.1  Related  Work 

The  multi-armed  bandit  problems  have  been  studied  extensively  for  many  years, 
however,  the  literature  applying  the  theory  of  the  multi-armed  bandit  problem  to  de¬ 
rive  a  probably  convergent  framework  for  solving  general  MDPs  is  very  few.  The  closest 
related  work  is  probably  that  of  Agrawal  et  al.  [2],  who  considered  a  controlled  finite- 
state/action-space  Markov  chain  problem  with  infinite  horizon  average  reward  criterion. 
In  their  setting,  transition  probabilities  and  initial  distribution  are  parameterized  by  an 
unknown  parameter  9  selected  from  some  known  finite  parameter,  with  each  fixed  para¬ 
meter  9  leading  to  an  ergodic  Markov  chain.  They  assume  that  for  each  9,  there  exists  a 
unique  optimal  stationary  policy.  They  consider  a  finite-horizon  loss  function  defined  over 
all  0’s  based  on  the  regret  of  [52] ,  and  regard  the  optimal  stationary  policy  for  the  average 
reward  as  an  approximation  for  an  optimal  nonstationary  policy  that  minimizes  the  loss 
for  the  finite  horizon.  By  then  using  the  optimal  stationary  policy  for  the  average  reward 
for  each  0,  they  develop  an  adaptive  but  rather  complex  policy,  the  performance  of  which 
is  bounded  in  terms  of  the  horizon  size  of  the  loss  function,  which  vanishes  as  the  size 
increases.  The  adaptiveness  comes  from  the  use  of  the  multi-armed  bandit  theory  for  the 
stationary  control  laws.  In  other  words,  the  arm  corresponds  to  a  particular  stationary 
law  or  policy,  but  not  a  particular  action  in  the  action  space. 
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3.2  Adaptive  Sampling  Algorithm 
3.2.1  Background 

We  consider  the  MDP  problem  M  =  (A,  A,  {Pt,  t  =  0, 1,  •  •  •},  {Rt,t  =  0, 1,  •  •  •},  «) 
with  finite  horizon  length  T,  finite  state  space  X,  finite  action  space  with  |A|  >  1,  and 
bounded  non-negative  one-stage  cost  function  Rf.  Again  for  simplicity  (and  without  loss 
of  generality),  we  assume  that  every  action  is  admissible  in  every  state. 

At  stage  t  <  T,  for  a  given  state  x,  we  define  the  optimal  discounted  reward-to-go 
at  state  x  from  stage  t  as 

fT-l  1 

=  sup  A  a^Ri{xi,TTi{xi))  xt  =  X  ,  x  G  A,  0  <  a  <  1,  t  =  0, ...,  T  —  1,  (3.1) 

Tren  L  J 

with  J^{x)  =  0  for  all  x  G  X,  where  11  is  the  set  of  all  possible  nonstationary  Markovian 
policies  TT  =  {7rt\nt  :  A  ^  >  0},  and  the  assumption  that  we  have  the  zero  ter¬ 

minal  reward  function  (for  simplicity)  can  be  relaxed  with  an  arbitrary  terminal  reward 
function.  Our  goal  is  to  estimate  for  a  given  initial  state  x,  the  optimal  discounted  total 
reward  (thereby  obtaining  an  approximate  optimal  policy)  Jq{x).  As  mentioned  earlier, 
the  objective  of  multiarmed  bandit  problems  is  to  identify  the  machine  that  have  the 
highest  reward.  Therefore,  for  ease  of  exposition,  it  is  natural  for  us  to  consider  in  (3.1)  a 
slightly  different  version  of  the  MDP  model  introduced  in  Chapter  2,  i.e.,  maximizing  the 
reward  instead  of  minimizing  the  cost.  However,  we  remark  that  all  results  can  be  easily 
extended  to  a  minimization  context,  we  will  come  back  to  this  issue  later  in  Chapter  3.4. 

By  Theorem  2.1.1,  the  optimal  reward-to-go  can  be  obtained  recursively  as  fol- 
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lows:  for  all  x  G  X  and  f  =  0, T  —  1, 

Jt{x)  =  m.ax(Ql (x ,  a)) ,  where  we  define 
a&A 

Q*t{x,a)  =  Rt{x,a) +  a'^  P^^y\t{a)J^^^{y).  (3.2) 

y&X 

The  right  hand  side  of  (3.2)  is  basically  the  sum  of  one-stage  cost  plus  the  expected  value 
of  the  future  optimal  cost-to-go,  therefore  a  natural  way  to  estimate  Ql(x,  a)  is  to  use  its 
sample  average  approximation  Qt{x,a)  given  by 

Qt{x,a)  =  Rt{x,a)  +  ^  (3.3) 

where  is  the  multiset  (in  which  the  same  element  may  appear  for  more  than  once)  of 
independently  sampled  next  states  according  to  the  transition  probability  .|t(a),  and 
^a,t  ■—  I'S’a  I  >  1  is  the  cardinality  of  the  set  with  J2a&A  ^a,t  —  for  a  fixed  Nt  >  |A| 
for  all  X  G  X,  and  J^^i^{y)  is  an  estimate  of  the  optimal  cost-to-go  at  the  sampled  next 
state  y.  Note  that  the  number  of  next  state  samples  depends  on  the  state  x,  action  a,  and 
stage  t.  If  we  further  estimate  the  optimal  value  (x)  by  a  weighted  sum 

AT* 

-^Qt{x,a). 

ccGA  ^ 

Then  we  have  the  following  recursive  relationship 

:=  X]  +  Y1  ^  =  0,...,r- 1, 

aeA  *  ^  y(zs^  / 

with  j)jy^(x)  =  Jy(x)  =  0  for  all  x  G  X. 

In  the  above  definition,  the  total  number  of  sampled  (next)  states  is  0{N'^)  with 
N  =  maxt=o,...,T-i  which  is  independent  of  the  state  space  size.  To  carry  out  the  above 
recursion,  we  need  to  determine  the  value  of  for  t  =  0,  ...,T  —  1,  a  G  X,  and  x  G  X. 
An  obvious  way  is  to  use  the  straightforward  non-adaptive  approach,  and  use  the  same 
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fixed  value  of  for  all  x,  a,  and  t.  But  here  we  consider  an  adaptive  allocation  rule 
(sample  scheme),  in  particular,  we  want  to  adaptively  choose  the  value  of  in  such  a 
way  that  the  expected  difference  between  J^°{x)  and  Jq{x)  is  bounded  as  a  function  of 
and  Nt,  t  =  0,...,T  —  1,  and  the  bound  goes  to  zero  as  Nt  goes  to  infinity. 

The  main  idea  behind  the  adaptive  allocation  rule  is  based  on  a  simple  interpre¬ 
tation  of  the  regret  analysis  of  the  multi-armed  bandit  problem,  where  plays  of  the  ith 
machine  (1  <  i  <  m,  m  is  the  total  number  of  machines)  yield  i.i.d.  random  rewards  with 
unknown  mean  and  the  goal  is  to  play  as  often  as  possible  the  machine  corresponding 
to  the  maximum  mean  fi* .  The  rewards  across  different  machines  are  also  assumed  to 
be  independently  generated.  Let  Ci{n)  be  the  number  of  times  the  ith  machine  has  been 
played  by  an  algorithm  during  the  first  n  plays.  We  define  the  expected  regret  p{n)  of  an 
algorithm  after  n  plays  by 

m 

p{n)  =  p*n  -  '^piE[Ci{n)\. 
i=l 

Lai  and  Robbins  [52]  characterized  an  “optimal”  algorithm  such  that  the  best  machine, 
which  is  associated  with  p* ,  is  played  exponentially  more  often  than  any  other  machine,  at 
least  asymptotically.  That  is,  they  showed  that  playing  machines  according  to  an  (asymp¬ 
totically)  optimal  algorithm  leads  to  p{n)  =  0(lnn)  as  n  ^  oo  under  mild  assumptions 
on  the  reward  distributions.  However,  obtaining  an  optimal  algorithm  (proposed  by  Lai 
and  Robbins)  is  often  very  difficult,  so  Agrawal  [1]  derived  a  set  of  simple  algorithms 
that  achieve  the  asymptotic  logarithmic  regret  behavior,  using  a  form  of  upper  confidence 
bounds.  The  use  of  the  upper  confidence  bound  leads  us  to  trade-off  between  exploitation 
and  exploration,  giving  a  criterion  of  which  of  the  two  between  exploitation  and  explo¬ 
ration  to  be  selected.  For  example,  let  h  be  the  number  of  overall  plays  (for  all  machines) 
so  far,  and  let  jxfin)  be  the  sample  mean  reward  accumulated  by  playing  machine  i.  During 
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the  plays,  we  are  tempted  to  take  the  machine  with  the  maximum  current  sample  mean 
(exploitation).  However,  jli{n)  is  just  an  estimate  of  the  true  mean,  which  may  contain 
high  variability.  Therefore,  always  playing  the  machine  that  yields  the  best  current  sample 
mean  is  obviously  non-optimal,  it  is  also  desirable  to  play  other  machines  occasionally  (ex¬ 
ploration).  To  account  for  the  variability  in  the  estimation,  we  try  to  find  a  function  (Ti{n) 
such  that  the  true  mean  /Xj  falls  in  the  confidence  interval  +  (Ti{n)) 

with  high  probability.  Agrawal’s  algorithm  is  to  choose  the  machine  with  the  highest 
upper  confidence  bound  at  each  play  over  time.  For  bounded  rewards,  [8]  propose  simple 
upper  confidence-bound  based  algorithms  that  achieve  the  logarithmic  regret  uniformly 
over  time,  rather  than  only  asymptotically,  and  our  sampling  algorithm  primarily  builds 
on  their  results. 

To  see  how  we  incorporate  the  confidence  bound  idea  into  an  adaptive  allocation 
rule  for  finite  horizon  MDPs,  we  consider  first  only  the  one-stage  problem  (i.e.,  T  =  1). 
For  this  problem,  by  definition  we  know  the  value  of  Ji{x)  for  all  x  G  X,  and  our  goal  is  to 
estimate  Jq{x).  From  (3.2),  it  is  obvious  we  need  to  obtain  a  viable  estimate  for  Qq{x,  a*), 
where  a*  G  argmax^g^((5o(x,  a)).  The  search  for  a*  corresponds  to  the  search  for  the  best 
machine  in  the  multi-armed  bandit  problem.  We  start  by  sampling  each  possible  action 
once  at  x,  which  leads  to  the  next  state  according  to  Px,-\o{a)  and  reward  Ro{x,a).  The 
next  action  to  sample  is  the  one  that  achieves  the  maximum  among  the  current  estimates 
of  Qq{x,  a)  plus  its  current  upper  confidence  bound  (see  (3.5)),  where  the  estimate  Qo{x,  a) 
is  given  by  the  immediate  reward  plus  the  sample  mean  of  Jj^-values  at  the  sampled  next 
states  that  have  been  sampled  so  far  (see  (3.6)).  The  above  procedure  is  repeated  until  a 
prespecified  total  number  of  sampling  budget  is  consumed,  see,  in  particular,  the  Loop 
step  in  Figure  3.1. 
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Given  the  total  number  of  samples  Nq  for  state  x  at  the  initial  stage,  denotes 
the  number  of  times  action  a  has  been  sampled.  If  the  sampling  is  done  appropriately, 
we  might  expect  that  in  the  long  run,  the  optimal  actions  will  be  sampled  significantly 
more  often  than  other  non-optimal  actions,  thus  N^q/Nq  provides  a  good  estimate  of  the 
likelihood  whether  action  a  is  optimal  in  state  x.  As  a  result,  in  the  limit  as  A^o  ^  oo, 
we  should  have  limATg^oo  J2a&A*  ^ao/^o  1;  where  A*  denotes  the  set  of  all  optimal 
actions.  For  this  reason,  we  use  a  weighted  (by  N^q/Nq)  sum  of  the  currently  estimated 
value  of  Qq{x,  a)  over  A  to  approximate  Jq{x)  (see  (3.7)).  Therefore,  as  the  weighted  sum 
concentrates  on  a*  as  the  sampling  proceeds,  we  will  have  the  convergence  of  the  estimate 
Jq%x)  to  Jo*(x). 

Remark  3.2.1  Throughout  this  Chapter,  the  notation  O  used  in  the  sense  that  for  given 
two  funetions  f  and  g,  f{n)  =  0{g{n))  i/ lim„^oo  =  c  for  some  eonstant  c  >  0, 
and  the  notation  0  is  used  in  that  there  exist  positive  eonstants  ci,  C2,  and  no  sueh  that 
0  <  cig{n)  <  f{n)  <  C2g{n)  for  all  n  >  no  ([25]).  The  O  and  Q-notations  are  often  ealled 
asymptotie  upper  bound  and  asymptotieally  tight  bound,  respeetively ,  for  the  asymptotie 
running  time  of  an  algorithm. 

3.2.2  Algorithm  description 

The  adaptive  multi-stage  sampling  (AMS)  algorithm  is  essentially  a  recursive  exten¬ 
sion  of  the  one-stage  sampling  approach  describe  in  preceding  two  paragraphs.  The  basic 
algorithmic  procedure  is  given  in  Figure  3.1.  The  inputs  to  the  algorithm  are  a  state  x  G  A, 
the  total  number  of  samples  Nt  >  \A\  allowed  at  stage  t,  and  the  output  of  algorithm  is  an 
estimate  of  the  true  optimal  reward-to-go  from  state  x  (x) .  The  AMS  algorithm  itself 
is  recursively  called  to  estimate  J],  (y)  whenever  we  need  to  calculate  the  value  (x)  for 
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Qtix,  a)  =  < 


(3.4) 


Adaptive  Multi-stage  Sampling  (AMS) 

•  Input:  a  state  x  G  X ,  Nt  >  \A\,  and  stage  t.  Output:  J^*-{x). 

•  Initialization:  Sample  each  action  a  €  A  seqnentially  once  at  state  x  and  set 

0  if  t  =  T  go  to  Exit 
Rtix,  a)  +  (y)  At^T, 

where  y  is  the  sampled  next  state  according  to  P^,  .|i(a),  set  the  total  current  nnmber  of 
samples  n  =  \A\. 

Loop:  Sample  the  action  a*  (estimate  of  the  trne  a*)  that  has  the  best  upper  confidence 
bound 

21nh 


aSA 


max  Qt{x,a)  + 


(3.5) 


where  t  denotes  the  number  of  times  action  a  has  been  sampled,  and  Qt  is  defined  by 


Qtix,a)  =  Rt{x,a)  +  a—  ^  J^+i^{y) 


(3.6) 


“4 


where  is  the  set  of  sampled  next  states  so  far  with  IS^I  =  with  respect  to  P^,  .p(a). 

-  Update  f  <—  f  +  1  and  Pf.  <—  Pf.  U  {y'},  where  y'  is  the  newly  sampled 
next  state  by  a*. 

-  Update  Qi{x,a*)  with  the  (y')  value. 

-  h<— h  +  1.  Ifh  =  Nt,  then  exit  Loop. 

•  Exit:  Set  (x)  such  that 

=  { 

and  return  J^*{x). 


EaeA  -^Qt{x,a)  =  -1 

0  if  t  =  T. 


(3.7) 


Figure  3.1:  Adaptive  multi-stage  sampling  algorithm  (AMS)  description 


a  state  y  G  A  at  stage  k  in  the  Initialization  and  Loop  subroutines  of  the  algorithm.  In 
particular,  we  need  to  call  AMS  recursively  (at  Equation  (3.4)  and  (3.6)).  The  initial  call 
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Figure  3.2:  The  sequence  of  the  recursive  calls  made  in  Initialization  of  the  AMS  al¬ 
gorithm.  Each  node  corresponds  to  a  state  and  each  arrow  with  noted  action  signifies  a 
sampling  (and  a  recursive  call).  The  bold-face  number  near  each  arrow  is  the  sequence 
number  for  the  recursive  calls  made.  For  simplicity,  the  entire  Loop  process  is  signified 
by  one  call  number. 

to  AMS  is  done  with  t  =  0  and  the  initial  state  xq,  and  every  sampling  is  done  indepen¬ 
dently  of  the  previously  done  samplings.  Figure  3.2  graphically  illustrates  the  sequence 
of  calls  with  two  actions  and  T  =  3  for  the  Initialization  portion.  We  remark  that  this 
sampling  strategy,  as  depicted  in  Figure  3.2,  resembles  the  recursive  decision  tree  in  the 
same  spirit  as  [49]  use  for  planning  algorithms,  and  the  non- recursive  simulated/sampling 
trees  [17]  use  for  an  American-style  option  pricing  problem  and  [33]  use  in  a  more  general 
MDP  setting.  However,  as  mentioned  before,  all  those  works  use  non-adaptive  sampling, 
in  the  sense  that  the  number  of  samples  for  each  action  is  pre-specified. 

Now  let  Mt  be  the  number  of  recursive  calls  made  to  compute  in  the  worst  case. 
At  stage  t,  AMS  makes  at  most  Mt  =  \A\NtMt+i  recursive  calls  (in  Initialization  and 
Loop).  Thus,  the  worst  case  running  time  complexity  of  AMS  is  Mq  =  0((jAj  max^  Nt)'^). 
In  contrast,  backward  induction  has  O(TjAjjAj^)  running  time  complexity  (see,  e.g.,  [16]). 
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Therefore,  the  main  benefit  of  AMS  is  independence  from  the  state  space  size,  but  this 
comes  at  the  expense  of  exponential  (versus  linear,  for  backwards  induction)  dependence 
on  both  the  action  space  and  the  horizon  length. 

3.3  Convergence  Analysis 

In  this  Chapter,  we  study  the  convergence  properties  of  the  AMS  algorithm.  In 
particular,  we  show  that  the  final  estimate  of  the  optimal  value  function  produced  by  the 
algorithm  is  asymptotically  unbiased,  and  the  worst  possible  bias  is  uniformly  bounded 
by  a  quantity  that  converges  to  zero  at  rate  O  (^Ylt=o  • 

We  first  consider  a  special  case  of  the  AMS  algorithm,  a  non-recursive  one-stage 
sampling  algorithm  (OSA)  results  from  by  applying  AMS  to  the  one-stage  approximation 
problem  described  earlier  in  Chapter  3.2.1.  The  algorithm  is  illustrated  in  Figure  3.3  with 
a  stochastic  value  function  U  defined  over  X.  U{x)  for  x  G  A  is  a  nonnegative  bounded 
random  variable  with  unknown  distribution  for  all  x  G  X.  U{x)  can  be  viewed  as  the 
outcome/observation  of  a  black  box  corresponding  to  a  input  x,  where  as  before,  when 
X  is  given  to  the  black  box,  we  assume  that  the  observations  at  different  time  instances 
are  independent  of  each  other,  and  are  identically  distributed  according  to  the  unknown 
distribution.  Let 

Cmax  =  max  (  i?(x,  a)  +  a  ^P,,y{a)E[U{y)]), 

’  ^  2/ex  ^ 

and  assume  for  the  moment  that  t/max  <  1-  Note  that  since  we  are  considering  the  one- 
stage  problem,  we  have  dropped  the  dependencies  on  stage  t  in  both  R  and  P.  However, 
we  should  keep  in  mind  that  this  setting,  as  well  as  all  subsequent  results,  hold  for  every 
stage  t  =  0, . . . ,  T  —  1. 

We  now  interpret  the  OSA  in  the  context  of  a  |A|-armed  bandit  problem,  where  each 
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One-stage  Sampling  Algorithm  (OSA) 

•  Input:  a  state  x  G  X  and  n  >  \A\. 

•  Initialization:  Sample  each  action  a  G  A  once  at  state  x  and  set 

Q{x,  a)  =  R{x,  a)  +  aU{y), 

where  y  ~  Px,-{a)  is  the  sampled  next  state.  Set  n  =  |A|. 

•  Loop:  Sample  an  action  a*  that  achieves 

where  (h)  is  the  nnmber  of  times  action  a  has  been  sampled  so  far 
at  state  x,  n  is  the  overall  nnmber  of  samples  done  so  far,  and 

Q{x,a)  =  R{x,a)  +  a  ^  U{y), 

“ '  yeA^(n) 

where  AJ(h)  is  the  set  of  sampled  next  states  so  far  with  |A„(h)|  =  T^(n). 


Update  (h)  ^  T^.  (n)  +  1  and  A^.  (h) 
sampled  next  state  by  a* . 

Update  Q{x,a*)  with  U{y'). 
h  <—  h  +  1.  If  h  =  n,  then  exit  Loop. 


Aq»  (n)  U  {y^},  where  y'  is  the  newly 


•  Exit:  Set  J"  such  that 


=  E 


T^{n) 


Q{x,a). 


(3.8) 


a€A 


Figure  3.3:  One-stage  sampling  algorithm  (OSA)  description 


action  a  corresponds  to  a  gambling  machine.  Successive  plays  of  machine  a  yield  “bandit 
rewards”  which  are  independent  and  identically  distributed  according  to  an  unknown 
distribution  6a  with  unknown  expectation 


Q{x,  a)  =  R{x,  a)  +  a 

yGX 
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and  are  independent  across  machines  or  actions. 

In  OSA,  T^{n)  represents  the  number  of  times  machine  a  has  been  played  (or  action 
a  has  been  sampled)  during  the  n  plays.  Define  the  expected  regret  p{n)  of  OSA  after  n 
plays  by 

lA 

p{n)  =  J{x)n—y  Q{x,a)E\T^{n)\,  where  J(x)  =  max(5(x,  a). 

aGA 

a=l 

We  now  state  a  key  theorem,  which  will  be  the  basis  of  our  convergence  results  for 
the  OSA  algorithm,  whose  proof  is  given  in  [8] . 


Theorem  3.3.1  For  all  |A|  >  1,  if  OSA  is  run  on  \  A\-machines  having  arbitrary  bandit 
reward  distribution  di,  with  t/max  <  1?  then 

8  Inn 


p{n)  < 

a:Q{x,a)<J{x) 


IT 


where  Q{x,  a)  is  the  expected  value  of  bandit  rewards  with  respect  to  6a 


The  convergence  of  the  OSA  algorithm  is  summarized  in  the  following  lemma. 


Lemma  3.3.1  With  the  stochastic  value  U  defined  earlier  with  Umax  <  1?  and  suppose 
the  total  number  of  sample  allowed  by  OSA  is  n.  Then  we  have,  for  all  x  G  X, 

E[J^{x)]  J{x)  as  n  ^  oo, 

where  J{x)  is  the  true  optimal  value  function  at  state  x  for  the  one-stage  problem,  i.e., 

J{x)  =  max  (  R{x,  a)  +  a  'Y  Px,y{a)E[U{y)\  ) . 

^  2/ex  / 

Proof:  Note  that  maxa(  J(x)  —  Q{x,  a))  <  Umax-  We  define  the  set  of  nonoptimal  actions 
for  X  as  (f^x)  =  {a\Q{x,a)  <  .J{x),a  G  A}.  Define  P{x)  for  /  0  such  that 

(3{x)  =  min  {J {x)  —  Q{x ,  a))  (3.9) 

a£(f>{x) 


39 


and  note  that  0  <  /3(x)  <  Umax-  Let 


where  Ci  and  C2  are  some  constants.  Since  X  is  finite,  there  exists  a  constant  C  >  0  such 
that  0  <  C*  <  and  also  that  p{n)  =  0  if  (j){x)  =  0.  By  the  definition  of 

(cf.  (3.8)),  it  follows  that 


J{x)  -  E[r{x)]  =  J{x)  -  E[J{x)  -  J{x)  +  r{x)] 

=  Jix)-E[J{x)]+E  ^^^(^Q(x,a)-Q(x,a))  .(3.11) 

.asA 

Clearly  by  (3.10),  the  first  term  J(x)  —E[J(x)]  above  is  bounded  by  zero  from  below  with 
convergence  rate  of  We  now  show  that  the  last  term  in  (3.11)  is  zero. 

Let  Yj  ~  {P2;^.(a)}  denote  the  (i.i.d.)  jth  next  state  sampled  from  the  same  starting 
state  x  with  same  action  a.  Then,  T^{n)  for  every  finite  n  is  a  stopping  time  (cf.  e.g., 
[64],  p.l04)  for  {Yj},  since  T^{n)  <  n  <  00  and  the  event  {Tf  (n)  =  k}  is  independent  of 
{Yk+i, . . .}.  It  follows  that 

^  (Qix,a)-Q{x,a)] 

-asA  ^  E 


by  applying  Wald’s  equation. 

Therefore,  the  convergence  follows  directly  from  (3.10)  and  (3.11).  I 
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We  are  now  ready  to  state  the  main  convergence  theorem  for  the  AMS  algorithm, 
whose  proof  is  based  upon  a  straightforward  inductive  application  of  Lemma  3.3.1. 


Theorem  3.3.2  Assume  that  the  one-stage  reward  funetion  is  uniformly  bounded  by 
i.e.,  iimax  =  T^Si^x,a,t  Rt{x,a)  <  7^.  Suppose  AMS  is  run  with  a  given  (arbitrary)  initial 
state  X  and  input  Nt,  t  =  0, ...,  T  —  1.  Then 

(1) 

lim  E[J^%x)]  =  Jo*(x). 

Nt^oo,  y  t=0,...,T  L  u  \  n 

Moreover,  the  worst  possible  bias  indueed  by  the  algorithm  is  bounded  by  a  quantity  that 
eonverges  to  zero  at  rate  O[Ylt=0 

(2) 

r,{x)  -  E[J^°{x)]  <  O  ,x  G  A, 

Proof:  At  stage  T  —  1,  by  the  definition  of 


jNt-1 

^T-l 


(x) 


< 


E 

a&A 


K,T-1 


Nt-1 


^Rt-i{x,  a)  +  a 


1 


N: 


a,T-l 


=  Rn 


xex. 


It  follows  that  at  stage  T  —  2 


jNt-2 

^T-2 


(x) 


< 


E 

aeA 

E 

aeA 


^^a,T-2 

Nj'_2 

^^a,T-2 

Nt-2 


RT-2{x,a)  +  a^ — 

“T-2  y^ss 


Rms,x  +  OlR  max  )  —  .I^max(l  “1“  Ck) )  X  G  X. 


And  by  induction,  we  have  for  all  x  G  A  and  t  =  0, ...,  T  —  1, 

T-t-l 

j/^*(x)  <  i?max  ^  a*  <  i?max(E  -  t)  <  1, 

i=0 

where  the  last  inequality  follows  from  the  assumption  R^g^Jd  <  1. 
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Therefore,  from  Lemma  3.3.1  with  f/max  =  Rma,x(T  —  t)  <  1,  we  have  for  t  = 
0, ...,  T  —  1,  and  for  arbitrary  x  ^  X, 

max  (Rt{x,a)  +  a  ^  P^^yit{a)E[J^^\y)] 

^  2/ex 

But  for  arbitrary  x  ^  X,  because  J^'^{x)  =  J^{x)  =  0,x  G  X, 


which  in  turn  leads  to  E[J^^2^{^)]  JT-2i^)  ^  Xt-2  oo  for  arbitrary  x  ^  X,  and 

by  an  inductive  argument,  we  have  that 


lim  E[J^°(x)]  =  Jo(^)  X  G  X, 

Nt^oo  y  t=0,...,T-l  u  \ 

which  completes  the  first  part  of  the  proof. 

To  show  the  second  part,  we  define  the  space  of  bounded  real- valued  measurable 
functions  on  X  by  B{X),  and  at  stage  t,  we  also  define  an  operator  Tt  ■  B{X)  B{X)  as 

7^($)(x)  =  max  a) -h  a  ^  G  .B(X),  xGX,  t  =  0,  ...,r-l. 

y&x  J 

(3.12) 

In  the  proof  of  Lemma  3.3.1  (see  (3.11)),  we  showed  that  for  t  =  0,  ...,T  —  1, 
Tt{E[J^_lt\^)])  -  E[J^\^)]  <  O  ,x  G  X 

Therefore,  we  have 


T„{E[jf‘(j^)l)  -  E[j„«'{i)]  <  O  ,x  e  X. 


(3.13) 


and 

Eli,"')!-)]  >  ri(E(j">(rc)l)  -  O  (!^X)  6  X  (3.14) 
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Applying  the  7o-operator  to  both  sides  of  (3.14)  and  using  the  monotonicity  property  of 
%  (see,  e.g.,  [13]),  we  have 

T„(£[jp{a>)l)  >  %{T,{E[J”‘{x)\))  -  O  (!^])  ,x€X.  (3.15) 

Therefore,  combining  (3.13)  and  (3.15)  yields 

T„(T,(E[jf  (i)]))  -  <  O  +  !^])  ,x€X. 

Repeating  this  argument  yields 

TO  •  •  •  {T^{E[J^^x)]))  -  E[J^<^{x)]  <  O  ^  (3.16) 

Observe  that  T^  •  •  •  (T^(T[J^^(x)]))  =  Jq{x),x  G  X.  Rewriting  Equation  (3.16),  we 
finally  have 

0  <  j;{x)  -  Eiij"")!)]  <  o  hA  j  ,  6  X, 

where  the  first  inequality  above  follows  because  J*{x)  is  the  true  optimal  reward-to-go.  I 


Remark  3.3.1  Note  that  the  assumption  Rmax  <  ^  can  he  relaxed  by  adding  a  sealing 
faetor  to  the  upper  eonfidenee  bound  in  (3.5),  i.e., 


max  [  Qt{x,  a)  +  RmaxT .  / 

“6^  V  V  ^“4 


It  ean  be  shown  that  the  result  in  Theorem  3.3.1  (ef.  [8])  now  beeomes 


p{n)  < 

a:Q[x.,a)<J{x) 


8R^„„T^lnn  _  nn 

[J{x)-Q{x,a)  +  -  ^(^’  «)) 


It  is  also  easy  to  verify  that  all  eonvergenee  results  (ineluding  eonvergenee  rate)  in  this 
Chapter  still  hold  for  this  modifieation. 
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3.4  A  Numerical  Example 


We  now  apply  the  AMS  algorithm  to  a  classical  finite  horizon  inventory  control 
problem  with  lost  sales.  In  these  problems,  the  inventory  level  is  periodically  reviewed, 
orders  are  placed  and  received,  demand  is  realized,  and  the  new  inventory  level  for  the 
period  is  calculated,  on  which  costs  are  charged.  The  objective  is  to  find  the  (in  general 
non-stationary)  policy  to  minimize  expected  costs,  which  comprise  holding,  order,  and 
penalty  costs.  In  here,  demand  is  assumed  to  be  a  discrete  random  variable. 

Let  Dt  denote  the  demand  in  period  t,  xt  the  inventory  level  at  the  end  of  period  t 
(which  is  the  inventory  at  the  beginning  of  period  t  +  1),  at  the  order  amount  in  period  t, 
p  the  per  period  per  unit  demand  lost  penalty  cost,  h  the  per  period  per  unit  inventory 
holding  cost,  K  the  fixed  (set-up)  cost  per  order,  and  L  the  maximum  inventory  level 
(storage  capacity),  i.e.,  xt  G  {0,1,..., L}.  Then  the  dynamics  of  the  inventory  level 
evolves  as  follows: 

xt+i  =  {xt  +  at  -  Dt)~^  . 

The  objective  function  is  the  expectation  of  the  total  cost  given  by 
T 

[h{xt  +  at-  Dt)^  -h  p{Dt  -  Xt  -  Of )+  -h  K  ■  /{a*  >  0}]  , 

i=l 

where  xq  is  the  initial  inventory  level,  T  is  the  number  of  periods  (time  horizon),  and  /{•} 
is  the  indicator  function.  Note  that  we  are  ignoring  per-unit  order  costs  for  simplicity. 

We  consider  two  versions:  (i)  fixed  order  amount  g;  (ii)  any  (integral)  order  amount 
(up  to  capacity).  In  both  cases,  if  the  order  amount  would  bring  the  inventory  level  above 
the  inventory  capacity  M,  then  that  order  cannot  be  placed,  i.e.,  that  order  amount  action 
is  not  feasible  in  that  state.  In  case  (i),  there  are  just  two  actions  (order  or  no  order), 
whereas  in  case  (ii),  the  number  of  actions  depends  on  the  capacity  limit. 
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Central  to  the  context  of  the  algorithm  is  that  the  underlying  distribution  is  un¬ 


known,  and  that  only  samples  are  available.  Furthermore,  there  is  no  structural  knowledge 
on  the  form  of  the  optimal  policy.  However,  the  example  selected  here  was  chosen  to  be 
simple  in  order  to  allow  for  the  optimal  solution  to  be  solved  easily  by  standard  techniques 
once  the  distribution  is  given,  so  that  the  performance  of  the  algorithm  could  be  evaluated. 

In  actual  implementation,  a  slight  modification  is  required  for  this  example,  because 
it  is  a  minimization  problem,  whereas  AMS  was  written  for  a  maximization  problem. 
Conceptually,  the  most  straightforward  way  is  just  to  take  the  reward  as  the  negative  of 
the  cost  function.  Equivalently,  we  change  (3.5)  in  AMS  by  replacing  the  “max”  operator 
with  the  “min”  operator  and  the  addition  with  subtraction,  i.e.. 


min 

asA 


With  A'  =  0  (no  fixed  order  cost),  the  optimal  order  policy  is  easily  solvable  without 
dynamic  programming,  because  the  periods  are  decoupled,  and  the  problem  reduces  to 
solving  a  single-period  inventory  optimization  problem.  In  case  (i),  the  optimal  policy 
follows  a  threshold  rule,  in  which  an  order  is  placed  if  the  inventory  is  below  a  certain 
level;  otherwise,  no  order  is  placed.  The  threshold  (order  point)  is  given  by 


s  =  min{x  :  hE[{x  +  q  —  D)^]  +  pE[{D  —  q  —  x)"*"]  >  hE[{x  —  D)~^]  +  pE[{D  —  x)’*']}, 

x>0 

i.e.,  one  orders  in  period  t  if  xt  <  s  (assuming  that  xt  +  q  <  L]  also,  if  the  set  is  empty, 
then  take  s  =  oo,  i.e.,  an  order  will  always  be  placed).  In  case  (ii),  the  problem  becomes 
a  newsboy  problem,  with  a  base-stock  (order  up  to)  solution  given  by 

S  =  F-\p/{p  +  h)), 


i.e.,  one  orders  {S  —  xt)~^  for  in  period  t  (with  the  implicit  assumption  that  S  <  L). 
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For  the  iT  >  0  case  (i),  the  optimal  policy  is  again  a  threshold  (order  point)  policy, 
but  the  order  point  is  nonstationary,  whereas  in  case  (ii),  the  optimal  policy  is  of  the  (s,  S) 
type,  again  non  stationary.  To  obtain  the  true  solutions,  standard  backwards  induction 
was  employed,  using  knowledge  of  the  underlying  demand  distribution. 

For  the  numerical  experiments,  we  used  the  following  parameter  settings:  horizon 
T  =  3;  capacity  L  =  20;  initial  inventory  xi  =  5;  demand  Dt  ~  DU (0,9)  (discrete 
uniform);  holding  cost  h  =  1;  penalty  cost  p  =  1  and  p  =  10;  fixed  order  cost  K  =  0  and 
K  =  5]  fixed  order  amount  for  case  (i):  q  =  10.  Note  that  since  the  order  quantity  is 
greater  than  the  maximum  demand  for  our  values  of  the  parameters,  i.e.,  q  >  Dt  always, 
placing  an  order  guarantees  no  lost  sales. 

3.4.1  Two  Alternative  Estimators 

Preliminary  experiments  with  the  algorithm  indicated  relatively  slow  convergence, 
so  we  decided  to  consider  alternative  estimators  to  improve  the  empirical  performance. 
But  first,  we  present  a  theorem,  which  will  be  useful  in  studying  these  estimators. 

Theorem  3.4.1  Let  {Xi,  i  =  1,2,...}  be  a  sequenee  of  i.i.d.  random  variables  with 
9  <  Xi  <  D  and  E[Xi]  =  fiM  i,  and  let  M  be  a  bounded  integer-valued  random  variable, 
with  0  <  M  <  K  for  some  positive  integer  K .  If  the  event  {M  =  nj  is  independent  of 

{Xn+i,  Xn+2,  ■  ■  •};  then  for  any  given  e  >  0  and  n  £  ,  we  have 

M 

P(|^  >£,M>n)  VrG(0,w),  (3.17) 

i=l 

where  Tmax  satisfies  Tmax  /  0  and  1  +  (H  +  £)Tmax  —  =  0  (see  Figure  3.4),  and 

A  l^\ _ e^^  —  l—rD 

Ad(t)  .-  - ^2 - • 
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Figure  3.4:  A  sketch  of  the  function  /i(r)  =  and  the  function  /2(t)  =  1  +  t{D  +  e). 

Proof:  Let  Yk  =  YM=i{Yi  —  It  is  easy  to  see  that  the  sequence  {Ifc}  forms  a 

martingale.  Therefore,  for  any  r  >  0, 

M 

f(— J^Ai-/r>e,M>n)  =  P(Ym  >  Me,  M  >  n), 

i=l 

=  P{tYm  -  Ad{t){Y)m  >rMe  -  Ad{t){Y)m,M  >n), 

where 

n 

(F)„  =  Y,E[{AY,f\P,-i],  AY,  =  Y,  -  Y,.^, 
j=i 

and  Pj  is  the  fi-field  generated  by  {li, . . .  ,Yj}. 

Now  for  any  r  G  (0,rmaa;))  and  for  any  ni  >  no,  where  no,  ni  G  Z^,  and  Z^  is  the 
set  of  all  positive  integers,  we  have 

_ 'j- 

r(ni  -  no)e  >  - - (m  -  no)!!^ 

ni  no 

>  AD{T)[^E[{AY,f\P,_,]-^E[{AY,f\P,_,]\, 
j=i  i=i 

which  implies  that 

rnie  -  A^(r)(y)ni  >  -  Ar»(r)(y)no,  V  r  G  (0,  Tmax)- 
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Thus  for  all  r  G  (0,  Tmax) 


1 

p(^—'^Xi-^>e,M>n^  <  P{tYm  -  Ad{t){Y)m  >Tne  -  AD{T){Y)n,M  >n), 

i=l 

<  P{tYm  -  Ad{t){Y)m  >  rne  -  Aij(r)nT>^,  M  >n), 

=  P(e^Tu-Ao(r)(y>M  >  grne-nAo(r)D2^^  > 


It  can  be  shown  that  (cf.  e.g.,  Lemma  1  in  [77],  pp.  505)  the  sequence  {Zt{T)  = 
QrYt-Ar)(T){Y)t ^  t  >  1}  with  Zo(t)  =  1  forms  a  non-negative  supermartingale.  It  follows 


that 


(3.18)  <  p(^p'^M-^D{r){Y)M  >  p-ne-nAoipD^^^^ 


<  P{  sup  Zt{T)  > 

0<t<K 


< 


ne-n^  (t)d'^  maximal  inequality  for  supermartingales  (cf.  [77]), 


—  g-n(r£-An(T)D^)  ^ 


By  using  a  similar  arguement,  we  can  also  show  that 

P(^— -e,M>n^  ^  ^-nps-AoiPD^) ^ 
i=l 

I 


Now  we  optimize  the  right-hand-side  of  (3.17)  over  r.  It  is  easy  to  verify  that  the 
optimal  T*  is  given  by  r*  =  4  In  G  (0,  Tmax)-  It  follows  that 


*  A  f  *\r,2  D  +  e  D  +  e  es 
T  e  -  Ad{t  )D  =  — ^  In  — ^ 


Therefore,  we  have 

1  Af  2 

p(|— -^1  >  e,M  >  n)  <  2e“”:^,  if  e  «  T>. 

i=l 
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When  M  is  deterministic,  this  result  is  very  similar  to  the  well-known  HoefFding’s  inequal¬ 


ity  [40].  Note  that  by  (3.17),  we  also  have 


M 


^  1  _  .-ire-Ao{r))D^  <  ^  VrG(0,w), 


n=0 


2=1 


which  in  turn,  by  the  Borel-Cantelli  lemma,  implies  that  the  event  Xi  —  > 

e,M  >  n}  will  only  happen  finitely  often  w.p.l  as  n  ^  oo.  And  since  e  is  arbitrary,  we 
have 


1 

M 


M 


'^^Xi  /i  w.p.l  as  M 


oo. 


(3.19) 


2=1 


Now  consider  an  estimator  that  chooses  the  action  that  is  sampled  the  most  in  order 
to  estimate  the  value  function,  i.e.,  for  t  <T, 


j/^*(x)  =  Qtix,al),  where  Oj  =  argmax{Nf ^},  (3.20) 

a  ’ 

Qtix,a)  =  Rt{x,a)  + 

a,t 

Now  we  show  that  if  the  one-stage-cost  function  R(x,  a)  is  deterministic^  this  estimator 
underestimates  the  optimal  value  function  w.p.l.  in  an  asymptotic  sense.  At  the  final 
stage  T,  we  clearly  have  J^'^{x)  =  J^{x)  =  0,  V  x.  Thus,  at  stage  t  =  T  —  1,  we  have 


—  RT-i{x,a^_i)  <  maxi?'r_i(a:,  a)  =  Jy_i(x),  V  x. 


It  follows  that 


=  RT-2{x.,a*rp_2)  +  a 


<  RT-2ix,atp-2)  ot 


—  E 

a^_2,T  2  y(z\=c^ 

“t-2 


1 


A]?*  rp_r,  — 

“t— 2’^  ^  J/SA 


<  Rt-2{x,  atp-2)  +  otE[J^_i{Y)\x,  al_2]  +  OiA^_2 


=  Jt-2{^)  +  «^T-2)  ^ 


49 


where  1"  is  a  random  variable  distributed  according  to  Px,-\t-2{0'T-2)^  ^T-2 

maXx,a{T^ — Ej/sA-  ^T-iiv)  “  «]}• 

a,T  —  2  ^  ^ 

Thus,  by  an  inductive  argument,  it  is  easy  to  see  that 

T-2 

Jo*(x)  + (3.21) 

i=0 

Thus,  by  taking  the  limit  at  both  sides  of  (3.21)  and  using  (3.19), 

T-2 

lim  J^°(x)  <  lim  |«/o(^)  +  y^  I  =  w.p.l, 

Aft^oo  Vi  ^  ATt^oo  Vi  I 

since  oo  as  Nt  ^  oo  y  a. 

We  combine  it  with  the  original  estimator  to  obtain  the  following  estimator: 

J^^{x)  =  Taayi{Qt{x,a*),'^  -^Qt{x,a)].  (3.22) 

Intuitively,  the  reason  behind  combining  via  the  max  operator  is  that  the  estimator  would 
be  choosing  the  best  between  the  two  possible  estimators  of  the  Q-function,  so  the  new 
estimator  will  at  least  share  the  same  convergence  rate  as  the  original  estimator. 

A  second  alternative  estimator  replaces  the  weighted  sum  of  the  Q-value  estimates 
in  (3.7)  by  the  maximum  of  the  estimates,  i.e.,  for  t  <T, 

J^^{x)  =  max Qt{x,  a).  (3.23) 

agA 

For  the  non-adaptive  case,  it  can  be  shown  that  this  estimator  is  also  asymptotically 
unbiased,  has  an  upward  finite-sample  bias  for  maximization  problems  and  downward 
finite-sample  bias  for  minimization  problems  such  as  the  inventory  control  problem.  Ac¬ 
tually,  it  turns  out  that  by  using  a  similar  argument  as  above,  we  can  in  fact  establish  the 
probability  one  convergence  of  this  estimator. 
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jNt- 

^T-1 


At  the  final  stage  t  =  T,  =  J^{x)  =  0,  V  x.  Thus,  when  t 

^(x)  =  maxa  Rt-i{x,  a)  =  J^_i{x),  V  x.  When  t  =  T  —  2,  we  have, 

=  \iiWLiRT-2{x,a)  +  a^ —  X] 

I  a,T-2  ' 


=  T  -  1, 


=  max 

a 

=  max 

a 

-\-Qi 


\RT-2{x,a)  +  a^ —  Jr-iiy)] 

I  a,T-2  y^Al  ' 

i^RT-2{x,a)  +  aE[j^_i{Y)\x,a] 

J^_M-E[J^-iiY)\x,a] 

a,T-2  y^A^ 


<  max  {i?'r_2(x,  a)  +  aE[j^_i{Y)\x,  a]  }  +  aAy_2 
=  J'f_2(x)  +  aA;^_2  V  X, 


where  A;^_2  :=  max^j^a  {  JT-i{y)~E[JT-i{Y)\x,  a]  }.  Thus,  by  the  monotonic¬ 

ity  of  the  dynamic  programming  algorithm,  we  have 

Jo*(x)  +  X;«'+'A+. 

t=0 


A  similar  argument  can  also  be  used  to  show  that 

T-2  T-2 

r,{x)  -  Y  <  J^\x)  <  Jo*(x)  +  a‘+iA+,  (3.24) 

t=0  t=0 

where  A)r  :=  max^j^a  {iil[jjYi(T)|x, a]  —  J2y£A^  '^t+iiv)}-  Hence,  since  both  the  state 
and  action  spaces  are  finite,  the  probability  one  convergence  of  the  estimator  follows  by 
taking  limit  at  both  sides  of  (3.24)  and  then  using  (3.19). 


3.4.2  Numerical  Results 

Figures  3.5,  3.6,  3.7,  and  3.8  show  the  convergence  of  the  estimates  as  a  function  of 
the  number  of  samples  at  each  stage  for  each  of  the  respective  cases  (i)  and  (ii)  considered. 
In  each  figure,  estimator  1  stands  for  the  original  estimator  using  (3.7),  and  estimators  2 
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and  3  refer  to  the  estimators  using  J{x)  from  (3.22)  and  J{x)  from  (3.23),  respectively. 
Tables  3.1  and  3.2  give  the  performances  of  these  estimators  for  each  of  the  respective 
cases  (i)  and  (ii),  including  the  optimal  value  and  policy  parameters.  The  results  indicate 
the  convergence  of  all  three  estimators.  We  see  that  the  two  alternative  estimators  provide 
superior  empirical  performance  than  the  original  estimator,  we  believe  that  this  is  because 
the  original  estimator  uses  the  weighted  sum  of  the  Q- function  estimates,  which  could  be 
too  conservative  for  test  cases  where  greedy  estimators  may  have  better  performances. 

3.5  Concluding  Remarks 

The  AMS  algorithm  targets  MDPs  with  relatively  large  state  spaces;  however,  for 
problems  where  a  relatively  small  set  of  states  are  likely  to  be  revisited,  it  might  be 
advantageous  to  store  calculated  values  of  to  avoid  having  to  possibly  recompute 
them,  which  could  result  in  substantial  savings  for  longer-horizon  problems,  since  it  would 
also  avoid  the  costly  recursive  calls.  The  trade  off  in  additional  required  storage,  possibly 
unmanageable  for  very  large  state  spaces,  would  have  to  be  evaluated  against  the  estimated 
resultant  gains  in  running  time. 

We  can  extend  the  AMS  algorithm  to  include  the  case  where  the  reward  function 
is  random.  The  AMS  algorithm  would  essentially  remain  identical,  except  that  sampling 
would  now  include  both  the  next  state  and  the  one-stage  reward.  However,  the  convergence 
proof  is  likely  to  require  more  technical  manipulations.  Furthermore,  the  assumption 
of  bounded  rewards  can  be  relaxed  by  using  the  result  in  [1].  Even  though  the  AMS 
algorithm  will  converge  too  in  this  case,  unfortunately,  we  lose  the  property  of  the  uniform 
logarithmic  bound  so  that  the  convergence  rate  is  expected  to  be  very  slow. 

Earlier  work  of  [18]  proposed  several  algorithms  that  achieve  the  regret  bounds  of 
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P=1 


p=10 


Figure  3.5:  Convergence  of  value  function  estimate  for  the  inventory  control  example  case 
(i)  g=10  as  a  function  of  the  number  of  samples  at  each  state: 

T  =  3,  M  =  20,  xo  =  5,  Dt  ~  DU{0, 9),h  =  l,K  =  0. 

the  form  ci  +  C2  log  n  +  C3  log^  n,  where  n  is  the  total  number  of  plays  and  Cj’s  are  positive 
constants  not  depending  on  n.  These  algorithms  might  also  be  used  to  create  adaptive 
sampling  algorithms  for  solving  MDPs.  However,  those  algorithms  have  the  drawback 
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P=1 


p=10 


Figure  3.6:  Convergence  of  value  function  estimate  for  the  inventory  control  example  case 
(i)  g=10  as  a  function  of  the  number  of  samples  at  each  state: 

T  =  3,  M  =  20,  xo  =  5,  Dt  ~  DU{0, 9),h  =  l,K  =  5. 

that  we  need  to  know  the  exact  value  of  a{x)  for  a  given  state  x  under  the  assumption 
that  not  all  of  the  actions  are  optimal,  which  is  difficult  to  obtain  in  advance.  This  holds 
also  for  other  algorithms  studied  in  [8]. 
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Figure  3.7:  Convergence  of  value  function  estimate  for  the  inventory  control  example  case 
(ii)  as  a  function  of  the  number  of  samples  at  each  state: 


T  =  3,  M  =  20,  xo  =  5,  Dt  ~  DU(0, 9),h  =  l,K  =  0. 


p=10 


Figure  3.8:  Convergence  of  value  function  estimate  for  the  inventory  control  example  case 


(ii)  as  a  function  of  the  number  of  samples  at  each  state: 


T  =  3,  M  =  20,  xo  =  5,  Dt  ~  DU(0, 9),h  =  l,K  =  5. 


{K,p) 

optimal 

N 

est.  1  (std  err) 

est.  2  (std  err) 

est.  3  (std  err) 

10.440 

4 

15.030 

(0.292) 

9.563 

(0.322) 

9.134 

(0.207) 

K  =  0 

s  =  0 

8 

12.819 

(0.156) 

10.297 

(0.096) 

10.208 

(0.102) 

p  =  1 

16 

11.747 

(0.093) 

10.376 

(0.079) 

10.326 

(0.081) 

32 

11.227 

(0.062) 

10.485 

(0.057) 

10.450 

(0.057) 

24.745 

4 

30.446 

(0.868) 

20.481 

(0.817) 

19.978 

(0.793) 

K  =  0 

s  =  6 

8 

28.843 

(0.491) 

23.679 

(0.515) 

23.091 

(0.554) 

p  =  10 

16 

26.691 

(0.382) 

23.937 

(0.450) 

23.882 

(0.437) 

32 

26.118 

(0.141) 

24.734 

(0.184) 

24.728 

(0.185) 

10.490 

4 

18.451 

(0.290) 

10.413 

(0.223) 

10.227 

(0.211) 

K  =  5 

Si  =  0 

8 

14.449 

(0.154) 

10.619 

(0.097) 

10.589 

(0.095) 

p  =  1 

o 

II 

16 

12.480 

(0.102) 

10.516 

(0.096) 

10.509 

(0.095) 

S3  =  0 

32 

11.473 

(0.065) 

10.458 

(0.064) 

10.458 

(0.064) 

31.635 

4 

37.523 

(0.980) 

26.917 

(0.894) 

26.418 

(0.883) 

K  =  5 

Si  =  6 

8 

36.172 

(0.430) 

30.406 

(0.508) 

30.132 

(0.487) 

p  =  10 

S2  =  6 

16 

33.812 

(0.399) 

30.802 

(0.432) 

30.763 

(0.431) 

S3  =  5 

32 

33.113 

(0.159) 

31.641 

(0.219) 

31.617 

(0.219) 

Table  3.1:  Value  function  estimate  for  the  inventory  control  example  case  (i)  as  a  function 
of  the  number  of  samples  at  each  state:  T  =  3,  M  =  20,  xq  =  5,  ~  DU (0,  9),q  =  W,h  = 

1.  Each  entry  represents  the  mean  based  on  30  independent  replications  (standard  error 
in  parentheses). 


57 


{K,p) 

optimal 

N 

est.  1  (std  err) 

est.  2  (std  err) 

est.  3  (std  err) 

7.500 

21 

24.057 

(0.160) 

9.793 

(0.209) 

3.123 

(0.170) 

K  =  0 

5  =  4 

25 

22.050 

(0.124) 

6.281 

(0.187) 

5.063 

(0.124) 

p  =  1 

30 

20.355 

(0.114) 

6.473 

(0.093) 

5.910 

(0.089) 

35 

18.823 

(0.111) 

6.618 

(0.110) 

6.263 

(0.097) 

13.500 

21 

29.171 

(0.210) 

13.686 

(0.463) 

6.035 

(0.301) 

K  =  0 

5  =  9 

25 

28.077 

(0.208) 

12.058 

(0.293) 

9.276 

(0.230) 

p  =  10 

30 

27.304 

(0.191) 

13.277 

(0.234) 

11.399 

(0.201) 

35 

26.058 

(0.164) 

13.072 

(0.157) 

12.232 

(0.176) 

10.490 

21 

33.047 

(0.124) 

18.624 

(0.437) 

8.727 

(0.209) 

K  =  5 

Si  =  0,  5i  =  0 

25 

29.994 

(0.095) 

11.786 

(0.158) 

10.957 

(0.109) 

p  =  1 

S2  =  0,  52  =  0 

30 

27.448 

(0.099) 

11.516 

(0.066) 

11.219 

(0.052) 

S3  =  0,  53  =  0 

35 

25.326 

(0.090) 

11.117 

(0.068) 

10.957 

(0.056) 

25.785 

21 

39.971 

(0.217) 

26.760 

(0.522) 

17.782 

(0.492) 

K  =  5 

Si  =  6, 5i  =  9 

25 

39.008 

(0.191) 

25.090 

(0.334) 

22.677 

(0.263) 

p  =  10 

S2  =  6, 52  =  9 

30 

38.029 

(0.163) 

25.453 

(0.273) 

24.345 

(0.174) 

S3  =  6,  53  =  9 

35 

36.891 

(0.116) 

25.514 

(0.276) 

24.707 

(0.230) 

Table  3.2:  Value  function  estimate  for  the  inventory  control  example  case  (ii)  as  a  function 
of  the  number  of  samples  at  each  state:  T  =  3,M  =  20,  xq  =  5,Di  ~  DU{0,9),  h  =  1. 
Each  entry  represents  the  mean  based  on  30  independent  replications  (standard  error  in 
parentheses). 
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Chapter  4 

An  Evolutionary  Random  Policy  Search  Algorithm  for  Solving  Inhnite  Horizon 
Markov  Decision  Processes  with  Discounted  Cost 

As  we  can  see  from  Chapter  1.1,  many  current  solution  methods  for  MDP  problems 
have  concentrated  on  reducing  the  size  of  the  state  space  in  order  to  address  the  well- 
known  “curse  of  dimensionality” .  However,  these  approaches  generally  require  the  ability 
to  enumerate  the  entire  action  space;  thus  they  may  still  be  practically  inefficient  for 
problems  with  large  action  spaces.  In  fact,  it  can  be  seen  that  MDPs  with  large  or 
uncountable  action  spaces  are  subject  to  inherent  computationally  intractability  (cf.  e.g., 
[71]).  The  reason  is  that  the  general  nonlinear  programming  problem  can  be  viewed  as 
a  special  case  of  the  MDP  problem,  thus  solving  general  MDPs  must  be  at  least  as  hard 
as  solving  the  general  (static)  multivariate  nonlinear  programming  problems.  This  has 
motivated  our  research  to  investigate  the  use  of  different  global  optimization  strategies  to 
improve  the  performance  of  the  current  MDP  solution  techniques. 

In  this  chapter,  we  propose  an  algorithm  called  Evolutionary  Random  Policy  Search 
(ERPS)  for  solving  infinite  horizon  discounted  cost  MDPs.  The  algorithm  is  meant  to 
complement  those  highly  successful  state  space  reduction  techniques  introduced  in  Chap¬ 
ter  1.1.  As  a  starting  point,  we  will  focus  on  MDPs  where  the  state  space  is  relatively 
small  but  the  action  space  is  very  large,  so  that  enumerating  the  entire  action  space  be¬ 
comes  practically  inefficient.  Eor  example,  consider  the  problem  of  controlling  the  service 
rate  of  a  single-server  queue  with  a  finite  buffer  size,  say  L,  in  order  to  minimize  the 
average  number  of  jobs  in  queue  and  the  service  cost.  The  state  space  of  this  problem 
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is  the  possible  number  of  jobs  in  the  queue  {0, 1, . . . ,  L},  so  the  size  of  the  state  space  is 
L  +  1,  whereas  the  possible  actions  might  be  all  values  on  an  given  interval  representing 
a  service  rate,  in  which  case  the  action  space  is  uncountable.  From  a  more  general  point 
of  view,  if  one  of  the  aforementioned  state  space  reduction  techniques  is  considered  (cf. 
Chapter  1.1),  for  instance,  say  state  aggregation,  then  MDPs  with  small  state  spaces  and 
large  action  spaces  can  also  be  regarded  as  the  outcomes  resulting  from  the  aggregation 
of  MDPs  with  large  state  and  action  spaces. 

Unlike  the  action  elimination  techniques  ([57],  [30],  cf.  also  Chapter  1.1),  ERPS 
approaches  the  issue  of  large  action  spaces  in  an  entirely  different  manner,  it  uses  an 
evolutionary,  population-based  approach  that  explicitly  specifies  a  set  of  good  policies, 
and  then  iterates  on  this  set  to  produce  improving  policies.  The  key  idea  is  to  avoid 
enumerating  the  entire  action  space  by  concentrating  the  search  on  a  restricted  action  set 
at  each  iteration  and  carrying  out  the  optimization  task  over  the  restricted  set.  For  a  given 
problem,  ERPS  proceeds  iteratively  by  constructing  and  solving  a  sequence  of  sub-MDP 
problems,  i.e.,  MDPs  defined  on  smaller  policy  spaces.  At  each  iteration  of  the  algorithm, 
the  sub-MDP  constructed  in  the  previous  iteration  is  approximately  solved  by  using  a 
variant  of  the  standard  policy  improvement  technique,  and  a  policy  called  an  elite  policy 
is  generated.  A  group  of  policies  is  then  generated  based  on  the  elite  policy  by  using  the 
“nearest  neighbor”  heuristic  and  random  sampling  of  the  entire  action  space,  from  which 
a  new  sub-MDP  is  created  by  restricting  the  original  MDP  problem  (e.g.,  cost  structure, 
transition  probabilities)  on  the  current  available  subsets  of  actions.  The  above  steps  are 
performed  repeatedly  until  a  specified  stopping  rule  is  satisfied.  The  algorithm  has  the 
property  that  an  elite  policy  generated  at  a  later  generation  is  guaranteed  to  outperform 
(in  terms  of  value  function)  the  elite  policy  at  the  current  generation.  We  show  that  as 
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the  number  of  iterations  goes  to  infinity,  the  sequence  of  elite  policies  will  converge  with 


probability  one  to  an  optimal  policy. 

Perhaps  the  most  straightforward  and  the  most  commonly  used  numerical  approach 
in  dealing  with  MDPs  with  uncountable  action  spaces  is  via  the  use  of  discretization  (cf. 
the  discussions  in  [72]).  In  practice,  this  could  lead  to  computational  difficulties,  either 
resulting  in  an  action  space  that  is  too  large  or  in  a  solution  that  is  not  accurate  enough. 
In  contrast,  our  approach  works  directly  on  the  action  space,  requiring  no  explicit  dis¬ 
cretization,  and  the  adaptive  version  of  the  algorithm  we  proposed  improves  the  efficiency 
of  the  search  process  and  produces  high  quality  solutions.  As  in  standard  approaches  such 
as  PI  and  VI,  the  computational  complexity  of  each  iteration  of  ERPS  is  polynomial  in 
the  size  of  the  state  space,  but  unlike  these  procedures,  it  is  insensitive  to  the  size  of  the 
action  space,  making  the  algorithm  a  promising  candidate  for  problems  with  relatively 
small  state  spaces  but  uncountable  action  spaces. 

4.1  Related  Work 

There  are  a  few  literatures  applying  evolutionary  search  methods  such  as  genetic 
algorithms  (GAs)  for  solving  MDPs.  Wells  et  al.  [86]  have  experimented  with  different  GA 
parameters  (e.g.,  cross-over  and  mutation  rates)  for  finding  good  limited  finite  memory 
policies  for  partially  observable  MDPs,  and  have  discussed  the  effects  of  different  GA 
parameters  based  on  the  empirical  performance  of  their  approach  on  a  maze  problem. 
Lin  et  al.  [55]  also  use  a  GA  approach  to  solve  finite  horizon  partially  observable  MDPs, 
however  in  their  approach,  GA  is  used  to  construct  approximations  of  the  minimal  set  of 
affine  functions  that  describes  the  value  function,  leading  to  a  variant  of  value  iteration. 
Barash  [10]  interprets  the  infinite  horizon  discounted  cost  MDPs  as  optimization  problems 
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over  the  policy  spaces  and  proposes  a  genetic  search  approach  that  directly  searches  the 


policy  space  to  find  good  stationary  policies.  He  concludes  that,  by  comparing  with 
the  performance  of  his  approach  with  that  of  the  standard  PI,  it  is  unlikely  that  policy 
search  based  on  GAs  can  offer  a  competitive  approach  in  cases  where  PI  is  implement  able. 
More  recently,  Chang  et  al.  [19]  propose  an  algorithm  called  evolutionary  policy  iteration 
(EPI)  to  find  good  stationary  policies  for  infinite  horizon  discounted  cost  MDPs  with 
discrete  state  and  action  spaces.  Their  approach  combines  the  standard  procedures  of 
GAs  with  the  properties  of  infinite  horizon  MDPs,  so  that  certain  monotonicity  property 
is  preserved  among  the  population  of  policies  generated  at  successive  iterations  of  the 
algorithm.  Although  their  algorithm  is  guaranteed  to  converge  with  probability  one, 
no  performance  comparisons  with  existing  techniques  are  provided,  and  the  theoretical 
convergence  requires  the  action  space  to  be  finite. 

ERPS  shares  some  similarities  with  the  EPI  algorithm  introduced  in  [19],  where 
a  sequence  of  “elite”  policies  is  also  produced  at  successive  iterations  of  the  algorithm. 
However,  the  fundamental  differences  are  that  in  EPI,  policies  are  treated  as  the  most  es¬ 
sential  elements  in  optimization,  and  each  “elite”  policy  is  directly  generated  from  a  group 
of  policies,  whereas  in  our  approach,  policies  are  regarded  as  intermediate  constructions 
from  which  sub-MDP  problems  are  then  constructed  and  solved;  EPI  follows  the  general 
framework  of  GAs,  and  thus  operates  only  at  the  global  level,  which  usually  results  in 
slow  convergence.  In  contrast,  ERPS  combines  global  search  with  a  local  enhancement 
step  (the  “nearest  neighbor”  heuristic)  that  leads  to  rapid  convergence  once  a  policy  is 
found  in  a  small  neighborhood  of  an  optimal  policy.  We  argue  that  our  approach  substan¬ 
tially  improves  the  performance  of  the  EPI  algorithm  while  maintaining  the  computational 
complexity  at  relatively  the  same  level. 
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4.2  Problem  Setting 


We  consider  the  infinite  horizon  (T  =  oo)  MDP  problem  (2.1)  described  in  Chap¬ 
ter  2.1  with  finite  state  space,  a  general  (Borel)  action  space,  and  discounted  cost  criterion 

J*{x)  =  inf^rgn  and 

(4.1) 

J^(x)  =  .E  =  aj]  ,  a  G  (0,1), 

where  throughout  this  chapter,  unless  otherwise  specified,  we  denote  the  set  of  all  station¬ 
ary  deterministic  policies  tt  :  X  ^  Ahy  H.  Assume  that  there  exists  a  stationary  policy 
TT*  G  n  that  achieves  the  optimal  value  J*{x)  for  all  initial  states  x  G  A,  and  our  objective 
is  to  find  such  a  policy.  Hereafter  in  this  chapter,  we  denote  the  size  of  the  state  space 
by  |A|,  and  assume  without  lost  of  generality  that  all  actions  a  G  A  are  admissible  for  all 
states  X  G  A. 

4.3  Algorithm  Description 

The  basic  algorithmic  structure  of  ERPS  is  given  in  Figure  4.1,  where  some  steps 
are  presented  only  at  a  conceptual  level.  We  will  provide  a  detailed  explanation  of  these 
steps  and  discuss  their  implementation  details  in  the  following  subsections,  where  each 
subsection  corresponds  to  a  particular  step  of  the  algorithm. 

4.3.1  Initialization 

The  inputs  to  the  ERPS  algorithm  are  an  action  selection  distribution  P,  an  ex¬ 
ploitation  probability  go  G  [0, 1],  a  population  size  n  >  1,  and  a  search  range  r*  for  each 
state  X*  G  A.  There  is  a  lot  of  flexibility  in  the  choices  of  the  initial  population  of  policies, 
we  can  even  take  all  policies  in  the  initial  population  to  be  exactly  the  same.  This  is 
because  of  the  randomized  search  technique  employed  in  ERPS  (cf.  Chapter  4.3.3),  which 
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Evolutionary  Random  Policy  Search  (ERPS) 

•  Initialization:  Specify  an  action  selection  distribution  V,  a  population  size  n  >  1,  and  a 

parameter  qo  G  [0,1].  For  each  state  x’'  €  X,  i  =  specify  a  search  range  n. 

Select  an  initial  population  of  policies  Ao  =  {ttiiTt®,  . . .  ,7r°}.  Construct  an  initial  sub-MDP 
as  Oao  ■=  (^,  Fo,  P,  R,  a),  where  Fo  =  (Ja,  ^o(®).  Set  :=  tt? . 

•  Loop  until  the  stopping  rule  is  satisfied: 

Policy  Improvement  with  Cost  Swapping  (PICS): 

k 

*  For  each  tt*  G  Afe,  compute  the  corresponding  value  function  J'"i  . 

*  Compute  the  elite  policy 

TT*  (x)  =  argmin  <  R{x,  a)  +  a  Px  v{a)[  min  (y)]  >  ,  V x  G  X. 

[  y  J 

Construct  a  Sub-MDP: 

*  for  ji  =  2  to  n 

for  i  =  1  to  jXj 

generate  a  r.v.  u  ~  17[0, 1], 
if  u  <  qo  (exploitation) 

choose  an  action  a  in  the  neighborhood  of  7rJ(x*)  by  using 
the  “nearest  neighbor”  heuristic.  Set  7r*"'"^(x*)  =  a. 
elseif  u  >  qo  (exploration) 

choose  an  action  a  according  to  P,  set  7r^^^(x*)  =  a. 

end  if 
end  for 
end  for 

*  Set  the  next  population  of  policies  as  Afc+i  =  {ttJ, 

*  Obtain  the  next  sub-MDP  :=  (X,  Ffc+i,  P,  R,  a),  where  Pfe+i  =  (J^  Afe+i(x). 

*  Set  k  -I—  k  +  1. 


Figure  4.1:  Evolutionary  Random  Policy  Search 


makes  the  theoretical  convergence  results  of  our  approach  is  independent  of  this  choice. 
However,  to  improve  the  performance  of  ERPS,  we  often  want  to  maintain  certain  diver¬ 
sity  among  the  group  of  policies  in  the  initial  population;  one  simple  method  to  achieve 
such  diversity  is  to  choose  each  individual  policy  uniformly  from  the  policy  space  H  (e.g. 
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according  to  a  uniform  distribution  over  the  policy  space). 

The  action  selection  distribution  V  is  a  prespecified  probability  distribution  over  the 
action  space,  and  will  be  used  to  construct  sub-MDPs  (cf.  Chapter  4.3.3).  Note  that  V 
could  be  state  dependent  in  general,  i.e.,  we  could  prescribe  for  each  state  x  G  X  a  different 
action  selection  distribution  according  to  some  prior  knowledge  of  the  problem  structure. 
Here,  for  ease  of  exposition,  we  ignore  its  explicit  dependency  on  state  and  prescribe  the 
same  V  for  all  x  G  X.  Again,  one  simple  choice  of  V  is  the  uniform  distribution.  The 
exploitation  probability  qq  and  the  search  range  r*  will  be  used  to  construct  sub-MDPs; 
the  detailed  discussion  of  these  two  parameters  is  deferred  to  Chapter  4.3.3. 

4.3.2  Policy  Improvement  with  Cost  Swapping 

The  idea  behind  ERPS  is  to  randomly  split  a  large  MDP  problem  into  a  sequence  of 
smaller,  manageable  MDPs,  and  to  extract  a  possibly  convergent  sequence  of  policies  via 
solving  these  smaller  problems.  For  a  given  population  of  policies  A  =  {tti,  7r2, . . . ,  Tin},  we 
consider  the  subsets  of  actions  given  by  A(x)  :=  {7ri(x), 7r2(x), . . .  ,TTn{x)}  Vx  G  A.  We 
can  then  define  a  sub-MDP  problem  :=  (A,  T,P,R,  a)  by  restricting  the  original  MDP 
(e.g.,  costs,  transition  probabilities)  on  these  subsets  of  actions,  where  P  :=  |J^  A(x).  Note 
that  for  a  given  state  x,  A(x)  is  in  general  a  multi-set,  which  may  contain  the  same  action 
for  more  than  once;  however,  we  can  always  discard  the  redundant  members  and  view  A(x) 
as  the  set  of  admissible  actions  at  state  x.  For  this  sub-MDP  one  can  of  course,  solve 
it  exactly  by  using  the  PI  algorithm,  thus  leading  to  a  policy  that  improves  all  policies  in 
the  current  population.  However,  it  is  well-known  that  PI  is  a  sequential  computational 
approach  and  will  in  general  take  more  than  one  iteration  to  find  such  a  policy.  So  instead 
of  solving  exactly,  here  we  propose  an  approach  that  solves  it  only  approximately.  The 
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approach  is  particularly  amenable  to  parallel  computing.  It  manipulates  the  policies  in 
a  given  population  by  combining  the  crossover  idea  in  standard  GAs  with  special  MDP 
properties,  and  is  able  to  obtain  an  improved  policy  in  just  one  iteration.  The  approach 
consists  of  the  following  two  steps  and  produces  a  policy  that  is  superior  to  all  of  the 
policies  in  the  current  population  we  call  “elite”  policy. 

Step  1  :  Obtain  the  value  functions  j  =  1, . . . ,  n,  by  solving  the  equations: 


=  i?(x,7rj(x))  +  a'^Px,y{'Kj{x))J'^^{y),  Vx  G  A. 


(4.2) 


Step  2:  Compute  the  elite  policy  tt*  by 


7r*(x)  =  argmin  <  i?(x,  a)  +  a  N  Px  y{a)[m.\n  {y)\\  ,  VxGA. 

aeAO)  I  ^  I 


(4.3) 


Since  in  (4.3),  we  are  basically  performing  the  policy  improvement  on  the  “swapped 
cost”  minTT^eA  (2:),  we  call  this  procedure  “policy  improvement  with  cost  swapping” 
(PICS).  PICS  can  be  thought  of  as  a  population-based  variant  of  the  standard  PI,  where 
essentially  we  view  each  policy  in  a  given  population  as  a  genetic  material,  and  the  way  we 
obtain  the  “swapped  cost”  corresponds  to  the  gene  crossover  in  standard  GAs.  Note  that 
the  “swapped  cost”  min^^.gA  (2:)  may  not  be  the  value  function  corresponding  to  any 
policy;  intuitively,  it  may  prevent  us  from  choosing  a  poor  starting  policy  in  the  policy 
improvement  step.  We  now  formalize  this  intuition  in  the  following  theorem. 


Theorem  4.3.1  Gixen  A  =  {tti,  7r2, . . . ,  7r„},  let  J(x)  =  min^^.gA  (2:)  Vx  G  A,  and  let 

y{x)  =  argmin  <  R{x,  a)  +  a  E  Px,y{(^{x))J{y)  /  • 
aeA(x)  [  2/  J 

Then  J^(x)  <  J(x),  Vx  G  A.  Furthermore,  if  y  is  not  optimal  for  Q\,  then  J^(x)  <  J(x) 
for  at  least  one  x  G  A. 


66 


Proof:  We  define  Jo(x)  =  R{x,fj,{x))  +  Px,y{n{x))J{y),  and  consider  the  sequence 

{Jj{x),  j  =  1, 2...}  generated  by  the  recursion  =  R{x,  y{x))+a'^y  Px,y{y{x))Ji{y), 

Vi  =  0, 1,  2, . . ..  At  an  arbitrary  state  x,  by  the  definition  of  J(x),  there  exists  nj  such 
that  J (x)  =  (x) .  It  follows  that 

Jo(x)  <  R{x,TTj{x))  +  aJ2yPx,y{'^j{x))J{y) 

<  R{x,Trj{x))  +  a^y  Px,y{nj{x)),ri {y) 

=  (x) 

=  j{x)  , 

and  since  x  is  arbitrary,  we  have 

Ji(x)  =  R{x,y{x)) +aJ2yPx,y{Kx))Jo{y) 

<  R{x,n{x))  +  aJ2yPx,y{Kx))jiy) 

=  Jo{x)  . 

By  induction  it  is  easy  to  see  that  Jj+i(x)  <  Ji{x),  Vx  G  X  and  Vi  =  0,1,2,....  On 
the  other  hand,  it  is  well  known  (cf.  e.g.,  [13])  that  the  sequence  Jo(x),  Ji(x),  J2(x), . . . 
generated  by  the  above  recursion  will  converge  to  J^(x),  Vx  G  A.  Therefore  we  have 
J^(x)  <  J(x),  Vx.  Note  that  if  J^(x)  =  J(x),  Vx  G  A,  then  PICS  reduces  to  the 
standard  policy  improvement  on  policy  y,  and  it  follows  that  y  satisfies  the  Bellman’s 
optimality  equation  and  is  thus  optimal  for  Hence  we  must  have  J^(x)  <  J(x)  for 
some  X  G  A  whenever  y  is  not  optimal.  I 

Now  at  the  kth  iteration,  given  the  current  policy  population  A^,  we  compute  the 
kth  elite  policy  vr^  via  PICS.  According  to  Theorem  4.3.1,  the  elite  policy  improves  any 
policy  in  A^,  and  since  tt^  is  directly  used  to  generate  the  {k+l)th  sub-MDP  (cf.  Figure  4.1 
and  Chapter  4.3.3),  the  following  monotonicity  property  is  immediately  clear: 
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Corollary  4.3.2  For  all  k  >0, 


(x)  <  J'^*{x),  \fxGX. 

Proof:  Follows  by  induction.  I 

The  PICS  is  similar  to  the  so-called  “policy  switching”  proposed  in  [19],  where  an 
“elite”  policy  is  also  obtained  at  each  iteration  of  the  method.  However,  unlike  PICS, 
policy  switching  constructs  an  elite  policy  by  directly  manipulating  each  individual  policy 
in  the  population.  More  specifically,  for  the  given  policy  population  A  =  {tti,  7r2, . . . ,  tt^}, 
the  elite  policy  is  constructed  as 

7r*(x)  G  <  argmin(J’^*(x))(x)  >  ,  Vx  G  X,  (4.4) 

I  TTiSA  J 

where  the  value  functions  ,  V  G  A  are  also  obtained  by  using  the  policy  evaluation 
step,  i.e.,  (4.2).  Chang  et  al.  [19]  have  shown  that  the  elite  policy  vr*  generated  by  (4.4) 
also  improves  any  policy  in  the  population  A.  Note  that  the  computational  complexity  of 
executing  (4.4)  is  O(njXj),  which  is  in  general  much  lower  than  the  computational  cost 
required  by  a  direct  optimization  over  the  entire  solution  space. 

In  contrast  to  policy  switching,  PICS  still  retains  an  optimization  mechanism  (as  in 
PI)  over  the  restricted  subsets  of  actions,  which  may  introduce  additional  computational 
cost.  However,  we  argue  that  PICS  will  in  general  substantially  improve  the  performance 
of  policy  switching  at  only  an  extra  neglectable  computational  expense.  To  illustrate 
this  point,  we  now  provide  a  intuitive  comparison  between  these  two  approach;  some 
empirical  evidences  can  also  be  found  later  in  Chapter  4.6.  For  a  given  group  of  policies 
A,  we  let  H  be  the  policy  space  induced  by  the  sub-MDP  it  is  easy  to  see  that  the 
size  of  H  is  on  the  order  of  As  we  see  from  (4.4),  policy  switching  only  takes  into 

account  each  individual  policy  in  A,  while  PICS  tends  to  search  the  entire  space  H  (by 
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carrying  out  an  optimization  over  0,),  which  is  a  much  larger  set  than  A.  Although  it 
is  not  clear  in  general  that  the  elite  policy  generated  by  PICS  improves  the  elite  policy 
generated  by  policy  switching,  since  the  policy  improvement  step  is  quite  fast  (cf.  e.g., 
[13])  and  it  focuses  on  the  best  policy  updating  directions,  we  believe  this  will  be  the  case 
in  many  situations.  For  example,  consider  the  case  where  the  population  A  contains  one 
particular  policy,  say  7f,  that  dominates  (in  terms  of  value  functions)  all  other  policies  in 
the  population.  It  is  obvious  that  policy  switching  will  choose  vf  as  the  elite  policy;  thus 
no  further  improvement  can  be  achieved  at  the  next  iteration.  In  contrast,  PICS  considers 
the  sub-MDP  as  long  as  vr  is  not  optimal  for  (cf.  Theorem  4.3.1),  a  strict  improving 
policy  can  always  be  obtained  in  the  next  iteration. 

The  computational  complexity  of  each  iteration  of  PICS  is  approximately  the  same 
as  that  of  policy  switching,  because  step  1  of  PICS,  i.e.,  (4.2),  which  is  also  used  by  policy 
switching,  requires  solution  of  n  systems  of  linear  equations,  and  the  number  of  numerical 
operations  required  by  using  a  direct  method  (e.g.,  standard  Gaussian  Elimination)  is 
0{n\X\^),  and  this  dominates  the  cost  of  step  2,  which  is  at  most  0{n\X\‘^). 

4.3.3  Sub-MDP  Generation 

The  description  of  the  “sub-MDP  generation”  step  in  Figure  4.1  is  only  at  a  concep¬ 
tual  level.  To  better  explain  this  step,  we  now  distinguish  between  two  different  settings. 
We  start  by  considering  the  case  where  the  action  space  is  discrete;  then  we  extend  our 
discussion  to  the  setting  where  the  action  space  is  continuous. 

Discrete  Action  Spaces 

By  Corollary  4.3.2,  the  performance  of  the  elite  policy  at  the  current  iteration  im¬ 
proves  the  performances  of  the  elite  policies  generated  at  previous  iterations.  However, 
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depending  how  new  policies  are  generated  and  constructed  at  each  iteration,  strict  im¬ 


provement  among  elite  policies  can  not  always  be  guaranteed.  Our  focus  now  is  how  to 
achieve  consistent  improvements  among  the  elite  policies  found  at  consecutive  iterations. 
Of  course,  one  possibility  is  to  use  unbiased  random  sampling  and  choose  at  each  iteration 
a  sub-MDP  problem  by  making  use  of  the  action  selection  distribution  V.  By  doing  so,  it 
is  obvious  that  we  may  always  obtain  an  improved  elite  policy  after  a  sufficient  number 
of  iterations.  Such  an  unbiased  sampling  scheme  is  very  effective  in  escaping  local  optima 
and  is  often  useful  in  finding  a  good  candidate  solution.  However,  in  practice  persistent 
improvements  will  be  more  and  more  difficult  to  achieve  as  the  number  of  iterations  (sam¬ 
pling  instances)  increases,  since  the  probability  of  finding  better  elite  policies  typically 
becomes  smaller  and  smaller.  We  refer  the  readers  to  [56]  for  a  more  insightful  discussion 
in  a  global  optimization  context.  Thus,  it  appears  that  a  biased  sampling  scheme  could 
be  more  helpful. 

The  biased  sampling  scheme  can  be  achieved  in  many  different  ways,  one  possibility 
is  via  the  use  of  the  “nearest  neighbor”  heuristic,  which  is  the  focus  of  our  approach.  To 
achieve  a  biased  sampling  configuration,  ERPS  combines  exploitation  ( “nearest  neighbor” 
heuristic)  with  exploration  (unbiased  sampling).  The  key  to  balance  these  two  types 
of  searches  is  the  use  of  the  exploitation  probability  qq.  For  a  given  elite  policy  tt,  we 
construct  a  new  policy,  say  tt,  in  the  next  population  generation  as  follows:  At  each  state 
X  G  X,  with  probability  gO)  '^{x)  is  selected  from  a  small  neighborhood  of  and  with 
probability  1  —  qo,  tt(x)  is  chosen  according  to  the  action  selection  distribution  V  (i.e., 
unbiased  random  sampling).  The  preceding  steps  are  performed  repeatedly  until  we  have 
obtained  n  —  1  new  policies,  and  the  next  population  generation  is  simply  formed  by  the 
elite  policy  vr  and  the  n  —  1  newly  generated  policies.  Intuitively,  the  use  of  exploitation 
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will  introduce  more  robustness  into  the  algorithm  and  helps  to  locate  the  exact  optimal 


policy,  while  on  the  other  hand,  the  exploration  step  will  help  the  algorithm  to  escape  local 
optima  and  to  find  attractive  policies  quickly.  In  effect,  we  see  that  this  idea  is  equivalent 
to  altering  the  underlying  action  selection  distribution,  in  that  V  is  artificially  made  more 
peaked  around  the  action  tt{x). 

To  give  out  a  detailed  implementation  of  the  “nearest  neighborhood”  heuristic,  we 
should  at  least  require  that  the  action  space  ^  is  a  non-empty  metric  space  with  a  defined 
metric  on  it.  Once  a  metric  d{-,  ■)  is  given,  the  “nearest  neighbor”  heuristic  in  Figure  4.1 
could  be  naturally  implemented  as  follows: 

Let  Vi,  a  positive  integer,  be  the  search  range  for  state  x*,  i  =  1,2,...,|X|.  We 
assume  that  ri  <\A\  for  all  i,  where  |A|  is  the  size  of  the  action  space. 

•  Generate  a  random  variable  I  according  to  the  discrete  uniform  distribution  between 
1  and  Vi,  i.e.,  I  ~  DU{l,ri).  Choose  an  action  =  7r(x*)  G  A  such  that  under 

the  given  metric  d{-,  •),  7r(x*)  is  the  Ith  closest  action  to  7r^(x*). 

Remark  4.3.1  Although  the  above  procedure  is  conceptually  easy,  sometimes  it  is  not  easy 
to  implement.  It  is  often  necessary  to  index  a  (possibly  high- dimensional)  metric  space, 
whose  complexity  will  depend  on  the  dimension  of  the  problem  and  the  cost  in  evaluating 
the  distance  functions  d{-,-).  However,  we  note  that  the  action  spaces  of  many  MDP 
problems  in  practice  are  subsets  of  ,  where  a  lot  of  efficient  methods  can  be  applied, 
such  as  Kd-trees  ([12])  and  R-trees  ([38]).  The  most  favorable  situation  is  an  action 
space  that  is  “naturally  ordered”,  e.g.,  in  inventory  control  problems  where  actions  are  the 
number  of  items  to  be  ordered  yl  =  {0, 1, 2,  •  •  •  },  in  which  case  the  indexing  and  ordering 
becomes  trivial. 
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In  EPI,  policies  in  a  new  generation  are  generated  by  the  so-called  “policy  mutation' 


procedure,  which  is  carried  out  by  altering  a  given  policy  in  the  following  manner:  for  each 
state  X,  the  currently  prescribed  action  is  replaced  probabilistically.  The  main  reason  for 
mutating  policies  is  to  avoid  being  caught  in  a  local  maximum,  making  a  probabilistic 
convergence  guarantee  possible.  Two  types  of  mutations  are  considered:  “global  mutation” 
and  “local  mutation”,  which  are  distinguished  by  the  degree  of  mutation,  as  indicated 
by  the  number  of  states  with  changed  actions  in  the  mutated  policy.  The  algorithm  first 
decides  whether  to  mutate  a  given  policy  tt  “globally”  or  “locally”  according  to  a  mutation 
probability  Pm-  Then  at  each  state  x,  7r(x)  is  mutated  with  probability  Pg  (Pi),  where 
Pg  and  Pi  are  the  respective  predefined  global  mutation  and  local  mutation  probabilities. 
It  is  assumed  that  Pg  is  generally  close  to  one  and  Pi  close  to  zero,  thus  Pg  ^  Pi]  the 
idea  is  that  “global  mutation”  helps  the  algorithm  to  get  out  of  local  optima  and  “local 
mutation”  helps  the  algorithm  to  fine-tune  the  solution.  If  a  mutation  is  to  occur,  the 
action  is  changed  by  using  the  action  selection  probability  V.  As  a  result,  we  see  that 
each  action  in  a  new  policy  generated  by  “policy  mutation”  either  remains  unchanged  or 
is  altered  by  pure  random  sampling;  although  the  so-called  “local  mutation”  is  used,  no 
local  search  element  is  actually  involved  in  the  process.  Thus,  as  we  can  see,  the  algorithm 
only  operates  at  the  global  level,  which  is  essentially  equivalent  to  setting  the  exploitation 
probability  go  =  0  in  our  approach. 

Continuous  Action  Spaces 

We  now  carry  the  biased  sampling  idea  one  step  further  by  considering  MDPs  with 
continuous  action  spaces.  We  let  Ba  be  the  smallest  ci-algebra  containing  all  the  open 
sets  in  A,  and  let  the  action  selection  distribution  V  he  a  probability  measure  defined  on 
{A,Ba)-  Again,  we  assume  that  there  is  a  metric  d{-,  ■)  defined  on  A.  Thus,  a  high  level 
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implementation  of  the  exploitation  step  in  Figure  4.1  can  be  described  as  follows: 

Let  Tj  >  0  denote  the  search  range  for  state  x*,  i  =  1,  2, . . . ,  \X\. 

•  Choose  an  action  uniformly  (according  to  a  uniform  distribution)  from  the  set  of 
neighbors  {a  :  (i(a,  7r^(x*))  <  r*,  a  G  A], 

Note  that  in  the  two  different  action  space  settings  we  have  discussed,  i.e.,  discrete 
case  and  continuous  case,  the  search  range  parameter  n  usually  has  different  meanings.  In 
the  former  case,  r*  is  a  positive  integer  indicating  the  number  of  candidate  actions  that  are 
the  closest  to  the  current  elite  action  7r^(x*),  whereas  in  the  latter  case,  r*  is  the  distance 
from  the  current  elite  action,  which  may  take  any  positive  real  value. 

If  we  further  impose  some  additional  structures  on  A  and  assume  that  A  is  a  non¬ 
empty  open  connected  subset  of  with  some  metric  (e.g.,  the  infinity- norm) ,  then  a 
detailed  implementation  of  the  above  exploitation  step  is  as  follows. 

•  Generate  a  random  vector  A*  =  (A|, . . . ,  XjqY'  with  each  ~  U[—l,  1]  independent 
for  all  /i  =  1,  2, . . . ,  A^,  and  choose  the  action  7r^’''^(x®)  =  7r^(x*)  -|-  AV*. 

•  If  then  repeat  the  above  step. 

Remark  4.3.2  We  remark  that  in  the  above  implementation,  the  same  ri  value  is  used 
along  all  direetions  of  the  aetion  spaee.  However,  in  praetiee,  it  is  often  useful  to  gener¬ 
alize  ri  to  a  N -dimensional  veetor  with  eaeh  eomponent  eontrolling  the  seareh  range  in  a 
partieular  direetion  of  the  aetion  spaee. 

Remark  4.3.3  The  metrie  d{-,-)  used  in  the  “nearest  neighbor”  heuristie  implieitly  im¬ 
poses  a  strueture  on  the  aetion  spaee.  The  effieieney  of  the  algorithm,  to  a  large  extent, 
depends  on  how  the  metrie  is  aetually  defined.  Like  most  of  the  random  seareh  meth¬ 
ods  for  global  optimizations,  our  approaeh  is  designed  to  explore  the  strueture  that  good 
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policies  tend  to  he  clustered  together.  Thus,  in  our  eontext,  a  good  nietrie  should  have 
a  good  potential  in  representing  this  strueture.  For  example,  the  diserete  metrie  (i.e., 
d{a,  a)  =  0  y  a  G  A  and  d{a,  b)  =  1  y  a,  b  G  A,  a  ^b)  should  never  be  a  good  ehoiee,  sinee 
it  does  not  provide  us  with  any  useful  information  about  the  aetion  spaee.  For  a  given 
aetion  spaee,  a  good  metrie  always  exists  but  may  not  be  known  a  priori.  In  the  speeial 
case  where  the  aetion  spaee  is  a  subset  oflk^,  we  take  the  Euelidean  metrie  as  the  default 
metrie,  this  is  in  aeeord  with  most  of  the  optimization  teehniques  employed  in  . 

4.3.4  Stopping  Rule 

There  is  a  lot  of  flexibility  in  the  choices  of  stopping  rules.  One  simple  choice  is  to 
stop  the  algorithm  when  a  specified  maximum  number  of  iterations  is  reached.  We  use, 
in  the  numerical  experiments  in  Chapter  4.6,  one  of  the  most  commonly  used  stopping 
rules  in  standard  GAs  (cf.  e.g.,  [19],  [79],  [86]).  We  stop  the  algorithm  whenever  3  k  > 
0,  such  that  ]]  jj  =  0  V  m  =  1,  2, . . . ,  A,  i.e.,  when  no  further  improvement  in 

the  elite  policy  (in  terms  of  value  function)  is  obtained  for  K  consecutive  iterations. 

4.4  Convergence  of  ERPS 

In  this  Chapter,  we  study  the  convergence  properties  of  ERPS,  in  particular,  we 
show  that  the  sequence  of  elite  policies  generated  by  ERPS  will  converge  asymptotically 
to  an  optimal  policy  with  probability  one.  We  start  by  defining  some  necessary  notations. 

Eor  a  given  metric  d{-,-)  on  the  action  space  A,  we  define  the  distance  measure 
between  two  policies  tt^  and  as 

doo(7r^,  vr^)  :=  max  (i(7r^(x*),  7r^(x*)). 
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We  can  now  further  define  the  cr-neighborhood  {a  >  0)  of  a  given  policy  tt  G  11  by 

AA(7r,  a)  :=  {7r|  doo{^,  vr)  <  ct,  Vtt  G  11}  . 

For  each  policy  tt  G  11,  we  also  define  P-^  as  the  transition  matrix  under  policy  vr  whose 
{x,y)th  entry  is  Px,y{n{x)),  and  define  as  the  one-stage  cost  vector  whose  {x)th  entry 
is  R{x,7r{x)).  Throughout  the  analysis,  we  denote  by  ||  •  ||oo  the  infinity-norm  over 
given  by  ||  J||oo  :=  \J{x)\. 

ERPS  is  randomized  approach,  each  run  of  the  algorithm  gives  a  particular  realiza¬ 
tion  of  the  sequence  of  elite  policies  (i.e.,  a  sample  path);  thus  the  algorithm  induces  a 
probability  distribution  over  the  set  of  all  such  sequences  of  elite  policies.  We  denote  the 
probability  measure  and  expectation  with  respect  to  this  distribution  by  'P(-)  and  E(-), 
respectively. 

The  convergence  of  ERPS  is  stated  in  the  next  theorem. 

Theorem  4.4.1  Let  tt*  he  an  optimal  poliey  with  eorresponding  value  funetion  ,  and 
let  the  sequenee  of  elite  polieies  generated  by  ERPS  together  with  their  eorresponding  value 
funetions  be  denoted  by  {tt^,  k  =  1,2, . . .}  and  { ,  k  =  1,2, . . .},  respeetively.  Assume 
that: 

1.  qo  <  1. 

2.  For  any  given  £  >  0,  V{{a\  d{a,  7r*{x))  <£,  a  G  vl})  >0,  y  x  G  X  (reeall  that  V{-) 
is  a  probability  measure  on  the  aetion  spaee  A). 

3.  There  exist  eonstants  a  >  0,  4>  >  0,  Li  <  oo,  and  L2  <  00,  sueh  that  for  all 

vr  G  M{tt* ,a)  we  have  WPj^  —  P^^*  ||oo  <  min  {Lidooin,  tt*),  —  4>}  (0  <  a  <  1),  and 

1 1  Rjr  Rtt*  1 1 00  ^  E2  doo  (^TT ,  TT  )  . 
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Then  for  any  given  e  >  0,  there  exists  a  random  variable  Aie  >  0  sueh  that  V{Aie  < 
oo)  =  1  and  E{M.^)  <  oo,  and  ||J^*  —  J’^*||oo  <  s  y  k  >  A4e- 

Assumption  1  restricts  the  exploitation  probability  from  pure  local  search.  Assump¬ 
tion  2  simply  requires  that  any  “ball”  that  contains  the  optimal  policy  will  have  a  strictly 
positive  probability  measure.  It  is  trivially  satisfied  if  the  set  {a|(i(a,  7r*(x))  <  £,  a  G  A} 
has  a  positive  (Borel)  measure  \/  x  G  X  and  the  action  selection  distribution  V  has  in¬ 
finite  tails  (e.g.,  Gaussian,  exponential).  Assumption  3  imposes  some  Lipschitz  type  of 
conditions  on  and  Rjt]  it  formalizes  the  notion  that  good  (near-optimal)  policies  are 
clustered  together,  i.e.,  the  optimal  policy  is  not  isolated  (cf.  Remark  4.3.3).  The  assump¬ 
tion  can  be  straightforwardly  verified  if  P-,^  and  R-,^  are  explicit  functions  of  tt,  which  is  the 
case  of  our  numerical  examples  in  Chapter  4.6.  For  a  given  e  >  0,  a  policy  vr  satisfying 
||j7r  _  <  e  is  often  referred  to  as  an  e-optimal  policy  (cf.  [13],  [63]). 

Remark  4.4.1  The  result  in  Theorem  4-4A  implies  the  a.s.  eonvergenee  of  the  sequenee 
{ ,  /c  =  0, 1, . . .}  to  the  optimal  value  funetion  J'^  .  To  see  this,  note  that  Theorem  4-4A 
implies  that  P{\\J'^*  —  J'"  jjoo  >  £)  ^0  as  k  ^  oo  for  every  given  e,  whieh  means  that 

k  ^11 

the  sequenee  eonverges  in  probability.  Furthermore,  sinee  \\oo  ^  £  M  k  > 

is  equivalent  to  sup^>^  jj  Uoo  <  e  V  /c  >  we  will  also  have  'P(supj;>^  jj  — 

>  £)  ^  0  as  k  ^  oo,  and  the  a.s.  eonvergenee  thus  follows. 

Proof:  We  first  try  to  derive  an  upperbound  for  jj  ] loo  in  terms  of  the  distance 

dooix-,  For  policy  tt*  and  policy  tt  we  have: 

.r*  =  R^*+aP^*r\  (4.5) 

=  R^  +  aP^r.  (4.6) 

Now  define  A  —  J'^* ,  AP^^*  =  and  AR^^*  =  R,^  —  and  subtract 
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the  above  two  equations.  We  have 

=  [/-(/-  -  aP^*)-^ {aAP^* r*  +  AR^*).  (4.7) 

Taking  the  infinity-norm  at  both  sides  of  (4.7)  and  using  the  consistency  property  of  the 
operator  norm  (i.e.,  ||^i?||  <  ||^||  •  ||7?||  ),  it  follows  that 

llAJ^loo  <  ||[/-(/-aP..)“^«AP^*]-ioo||(/-a^’.*)“'lloo(«||AP^*||oo||J"*||oo+||Ai?^*||oo). 

(4.8) 

Note  that  assumption  3  implies  HAP^rHIoo  <  Thus 

11(1  -  aP^*)“^aAP,r*||oo  <  11(1  -  aT’7r*)~^llooa||AP^*||oo 

<  11(1- q;P^.)“^IIoo(1  -  a) 

<  1. 


To  proceed,  we  now  distinguish  between  two  cases,  ||J’^*||oo  =  0  and  ||J^*||oo  7^  0. 
Case  1.  If  i?7r*  =  0  (i.e.,  R{x,tt*{x))  =  0  for  all  x  G  X),  then  we  have  =  0. 

Thus  AJ""  =  and  AR.,^*  =  Rn-  By  noting  ||P7r||oo  =  1,  it  follows  from  (4.6)  that 


lAJ^ 


=  IIJ” 


< 


1  -  a\\P^ 


\Rn 


1-a' 


\AR, 


Then  by  assumption  3, 

||AJ’"*||oo  <  — ^doo(vr,7r*).  (4.9) 

1  —  a 

Case  2.  If  R^^*  >  0  (i.e.,  R{x,  7r*(x))  >  0  for  some  x  G  X),  then  from  (4.5),  >  0. 

Divide  both  sides  of  (4.8)  by  ||  ||oo,  use  the  relation  that  ||(/  — i?)“^||  <  whenever 
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|-B||  <  1  and  the  consistency  property;  it  immediately  follows  that 


IIAJ’’ 


< 


-ii 


l-\\{I-aP^*)  i||ooa||AP^ 


-ii 


III  -  aP^* 


IT*  ||00 


< 


< 


l-||(/-aP^*)  i||ooa||AP^*||oo 

JC  f  «||AP^.|U 

ll-^  OtP-K*  I  loo 


1  _  O^IA-P-n-*  l|oo 

^||7-aP,*||oo 

JC 

_  y^_“IA-p7r*  l|oo 


+ 


a||AP^*|L- 

«||AP^*||oo 

11^  (^P-K*  ||oO 

I  Ai?^*  Ilf 

I I  Pj'TT*  1 1  or 


||Ai?, 


IIAii^ 


1/  -  aP-r, 


IJ^ 


Iloo 


\\I-aP^ 


aLi 


L2 


\I-aP 


TT*  00 


WlP 


doo(7r,7r*), 


(4.10) 


where  /C  =  ||(I  —  aP-^*) 


-ii 


1/  -  aRr 


In  either  case  (see  (4.9),  (4.10)),  we  conclude  that  for  any  given  e  >  0,  there  exists 
a  0  >  0  such  that  for  any  vr  G  M{'k*  ,a)  where 

doo(7r, TT*)  :=  max  d(7r(x*), 7r*(x*))  <  0, 

1<2<|X| 

we  have  \\J'^  —  J^*||oo  =  ||AJ^*||oo  <  £•  Note  that  maxi<j<|x|  d('7r(a:*), 7r*(x*))  <  6*  is 
equivalent  to 

d(7r(x*),7r*(x*))  <9,  V  i  =  1,2,...,|X|.  (4.11) 

By  assumption  2,  the  set  of  actions  that  satisfies  (4.11)  will  have  a  strictly  positive  prob¬ 
ability  measure,  and  since  go  <  1,  it  follows  that  the  probability  a  population  generation 
does  not  contain  a  policy  in  the  neighborhood  AA(7r*, min  {0,  cr})  of  the  optimal  policy  is 
strictly  less  than  1.  Let  ip  be  the  probability  that  a  randomly  constructed  policy  is  in 
AA(7r*, min {0, cj}).  Then  by  Theorem  4.3.1,  at  each  iteration  the  probability  that  an  elite 
policy  is  obtained  in  AA(7r*,  min  {9,  a})  is  at  least  1  —  (1  —ip)"'~^,  where  n  is  the  population 
size.  Let  Aie  denote  the  number  of  iterations  required  to  generate  such  an  elite  policy 
for  the  first  time.  By  the  monotonicity  of  the  sequence  {J^*,  k  =  0, 1, . . .}  (cf.  Corol- 
lary  4.3.2),  it  is  clear  that  ||  —  J'^  ||oo  <  £  V  A:  >  Aie-  Now  consider  a  random  variable 

M  that  is  geometrically  distributed  with  a  success  probability  of  1  —  (1  —  ■0)'^“^.  It  is 
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not  difficult  to  see  that  Ai  dominates  Ate  stochastically  (i.e.,  Ai  >st  Aie),  and  because 
■0  >  0,  it  follows  that  E(Ais)  <  E{AA)  =  <  oo.  I 

Remark  4.4.2  In  the  above  proof,  we  have  used  the  infinity -norm.  Sinee  in  finite  di¬ 
mensional  spaees  all  norms  are  equivalent  (ef.  [28]),  similar  results  ean  also  be  easily 
established  by  using  different  norms,  e.g.,  the  Euelidean-norm. 

Remark  4.4.3  It  should  be  noted  that  the  result  presented  in  Theorem  4-4  A  is  rather 
theoretieal,  beeause  nothing  ean  be  said  about  the  eonvergenee  rate  of  the  algorithm  as 
well  as  how  mueh  improvement  ean  be  aehieved  at  eaeh  iteration.  As  a  eonsequenee,  the 
random  variable  Aie  eould  be  extremely  large  in  praetiee. 

Note  that  for  a  finite  action  space,  assumption  3  in  Theorem  4.4.1  is  automatically 
satisfied,  and  assumption  2  also  holds  trivially  if  we  take  'P(a)  >  0  for  all  actions  a  G  A. 
Furthermore,  when  the  action  space  is  finite,  there  always  exists  an  e  >  0  such  that 
the  only  e-optimal  policy  is  the  optimal  policy  itself.  We  have  the  following  stronger 
convergence  result  for  ERPS  when  the  action  space  is  finite. 

Corollary  4.4.2  (Finite  aetion  spaee)  If  the  aetion  spaee  is  finite,  qo  <  1,  and  the  aetion 
seleetion  distribution  'P(a)  >  0  Vo  G  4.,  then  there  exists  a  random  variable  Ai  >  sueh 
that  V{Ai  <  oo)  =  1  and  E{Ai)  <  oo,  and  =  J'^  A  k  >  Ai. 

4.5  Adaptive  ERPS 

The  search  range  parameter  in  ERPS  is  fixed  throughout  the  algorithm.  Intu¬ 
itively,  small  search  ranges  concentrate  the  search  in  small  regions  around  the  desirable 
points  and  are  helpful  in  refining  promising  solutions,  but  they  often  lead  to  small  im¬ 
provements  in  the  cost  function,  thus  slowing  down  the  convergence  process.  On  the  other 


79 


hand,  large  search  ranges  typically  reduce  the  number  of  search  steps  needed  to  find  a 


good  or  near  optimal  solution,  but  can  be  less  effective  in  developing  finer  details  around 
desirable  points  and  may  result  in  less  accurate  solutions.  In  this  Chapter,  we  present 
a  modification  of  the  ERPS  method  in  which  the  value  of  the  search  range  parameter 
may  change  from  one  iteration  to  another.  The  idea  is  to  adaptively  shrink  and  expand 
the  search  range  so  that  we  can  speed  up  the  convergence  process  without  sacrificing  the 
solution  quality.  A  detailed  description  of  the  adaptive  ERPS  is  given  in  Eigure  4.2,  where 
we  only  consider  the  continuous  action  space  case;  the  discrete  action  space  version  can 

be  constructed  similarly. 

Adaptive  ERPS 

•  Initialization:  Specify  an  initial  search  range  r,  parameters  K,  1  <  K\  <  K,  K2  >  1,  K^,  >  1, 

7  >  1  and  a  tolerance  level  e  >  0,  where  K  is  the  stopping  control  parameter  as  in  ERPS.  Set 
i  <—  0,  j  ^  0,  and  h  ^  0. 

•  while  it  <  K  k,  h  <  K3) 

Execute  ERPS  with  search  range  r. 

Search  range  update: 

if  0  <  II  <  e,  then  set  *  <—  0,  j  ^  j  +  1; 

elseif  II II  =  0,  then  set  i  ^  i  +  1,  j  ^  Q; 

else  set  z  <—  0,  <—  0. 

end  if 

if  i  >  K\,  then  set  Void  ^  r,  r  ^  r  ■  T  end  if 

if  j  >  K2,  then  set  r  -i—  r  ■  end  if 

if  r  =  Void,  then  set  h  -i—  h  +  1;  else  set  h  ^  0.  end  if 

end  while 


Eigure  4.2:  Adaptive  ERPS 


We  start  by  running  ERPS  with  an  initially  specified  search  range  r  (for  simplic- 
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ity,  we  assume  that  the  same  search  range  is  prescribed  for  all  states),  and  monitor  the 
performance  of  the  elite  policy  obtained  at  each  iteration.  If  no  improvements  among  the 
elite  policies  are  achieved  for  several,  say  Ki,  consecutive  iterations,  then  it  indicates  that 
the  current  search  range  may  be  too  large,  and  we  decrease  it  by  a  factor  7  >  1.  On  the 
other  hand,  if  for  some  consecutive  iterations,  say  K2,  the  improvements  are  non-zero  but 
smaller  than  some  given  tolerance  e,  then  it  is  likely  that  the  current  search  range  is  too 
small,  and  we  increase  it  by  7  until  the  improvement  is  greater  than  the  specified  tolerance 
level.  The  search  range  is  updated  repeatedly  until  it  has  been  alternating  between  two 
values  for  times.  Intuitively,  the  adaptive  ERPS  ensures  that  each  improvement  in  the 
elite  policy  is  (approximately)  at  least  e;  when  no  further  improvement  is  available  either 
by  increasing  or  by  decreasing  the  search  range,  the  value  function  obtained  will  be  within 
distance  e  of  the  optimal  cost,  i.e.,  the  resulting  elite  policy  is  approximately  e-optimal. 

Note  that  the  validity  of  the  e-optimality  claim  relies  on  the  assumption  that  if 
there  is  an  improvement  of  at  least  e  available,  then  the  algorithm  will  be  able  to  find  it 
via  adaptive  adjustment  of  the  search  range.  The  above  approach  retains  the  theoretical 
convergence  properties  of  the  original  ERPS  method  and  can  be  applied,  at  least  in  prin¬ 
ciple,  to  many  types  of  action  spaces  as  long  as  a  metric  can  be  specified;  however,  we 
must  again  emphasis  that  the  efficiency  of  the  approach  will  depend  on  the  structure  of 
the  problem  to  be  solved  and  how  the  underlying  metric  is  actually  defined. 

4.6  Numerical  Examples 

In  this  Chapter,  we  investigate  the  empirical  performance  of  ERPS  by  applying  it  to 
two  discrete-time  controlled  queueing  examples  and  comparing  its  performance  with  those 
of  EPI  ([19])  and  standard  PI.  Throughout  the  experiment  with  ERPS,  we  use  the  same 
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search  range  parameter  value  for  all  states,  denoted  by  a  single  variable  r,  and  choose  the 
uniform  distribution  as  the  action  selection  distribution.  All  computational  time  units  are 
in  seconds. 

4.6.1  A  One-Dimensional  Queueing  Example 

The  following  example  has  previously  been  studied  in  several  approximate  dynamic 
programming  literatures  (cf.  e.g.,  [13],  [27]).  Consider  a  single-server  queue  with  finite 
capacity,  where  the  server  can  serve  only  one  customer  in  a  period,  and  the  service  of  a 
customer  begins/ends  only  at  the  beginning/end  of  any  period.  Assume  at  any  period 
of  time,  there  is  at  most  one  customer  arrival,  and  arrivals  at  the  queue  are  independent 
with  probability  p  =  0.2  (i.e.,  no  arrival  with  probability  0.8).  The  maximum  queue 
length  is  T,  and  an  arrival  that  finds  C  customers  in  the  queue  is  lost.  We  denote  by  xt 
the  state  variable,  be  the  number  of  customers  in  the  system  at  the  beginning  of  period 
t.  The  action  (control)  to  be  chosen  at  each  state  is  the  service  completion  probability 
of  the  server,  denoted  by  a,  which  takes  value  in  a  set  A.  In  period  t,  if  a{xt)  is  chosen, 
then  a  service  is  completed  with  probability  a{xt),  and  a  cost  of  R{xt,a{xt))  is  incurred, 
and  resulting  in  a  transition  to  state  xt+i-  The  goal  is  to  choose  the  optimal  service 
completion  probability  for  each  state  such  that  the  total  infinite-horizon  discounted  cost 
E[J2^oOi^R{xt,a{xt))]  is  minimized. 

For  this  example,  we  consider  two  different  choices  of  one-stage  cost  functions:  (i) 
a  simple  function  that  is  convex  in  both  state  and  action,  where  the  one-stage  cost  at  any 
period  for  being  in  state  x  and  taking  action  a  is  given  by 

R{x,  a)  =  X  +  50a^; 
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(ii)  a  complex  non-convex  cost  function 


R{x,  a)  =  x  +  5 


|X| 

sin(27ra)  —  x 


1  2 


which  induces  a  tradeoff  in  choosing  between  large  values  of  a  to  reduce  the  state  x  and 
appropriate  values  of  a  to  make  the  squared  term  small.  Intuitively,  the  MDP  problem 
resulting  from  case  (i)  may  have  some  nice  properties  (e.g.,  free  of  multiple  local  optimal 
solutions),  so  finding  an  optimal  solution  should  be  a  relatively  easy  task;  whereas  the  cost 
function  in  case  (ii)  introduces  some  further  computational  difficulties  (e.g.,  multiple  local 
minima),  intended  to  more  fully  test  the  effectiveness  of  a  global  algorithm  like  ERPS. 

For  both  cases,  unless  otherwise  specified,  the  following  parameter  settings  are  used: 
maximum  queue  length  C  =  48;  state  space  X  =  {0,1,2,...,  49};  discount  factor  a  =  0.98; 
and  in  ERPS,  population  size  n  =  10,  search  range  r  =  10,  and  the  standard  Euclidean 
distance  is  used  to  define  the  neighborhood.  All  computational  results  for  ERPS  are  based 
on  30  independent  replications. 


Discrete  Action  Space 

We  first  take  the  action  space  to  be  A  =  {l0“^A:  :  /c  =  0, 1, . . . ,  10^},  a  discretized 
version  of  the  continuous  interval  [0, 1].  For  this  setting,  we  test  the  convergence  of  ERPS 
by  varying  the  values  of  the  exploitation  probability.  Table  4.1  gives  the  performance  of 
the  algorithm,  where  we  define  the  relative  error  of  a  value  function  J  by 

relerr  :=  — =7—^,  (4-12) 

1 1  \  \  00 

and  J*  is  the  optimal  value  function,  which  is  obtained  by  using  the  standard  PI.  The 
computational  time  required  for  PI  to  find  the  optimal  value  function  J*  was  15  seconds, 
and  the  value  of  ||J*||oo  is  approximately  2.32e+03.  Test  results  clearly  indicate  superior 
performances  of  ERPS  over  PI;  in  particular,  when  go  =  0.25, 0.5,  0.75,  ERPS  attains  the 
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optimal  solution  in  all  30  independent  trials  within  2  seconds. 


90 

stop  rule  (K) 

Avg.  time  (std  err) 

1  mean  relerr  (std  err) 

2 

0.84  (0.03) 

7.63e-06  (8.50e-08) 

4 

1.41  (0.05) 

2.78e-06  (3.29e-07) 

0.0 

8 

2.67  (0.10) 

7.83e-07  (1.06e-07) 

16 

5.12  (0.16) 

1.81e-07  (1.88e-08) 

32 

8.91  (0.38) 

6.19e-08  (1.07e-08) 

2 

0.94  (0.02) 

3.32e-09  (1.42e-09) 

4 

1.08  (0.02) 

9.65e-10  (2.59e-10) 

0.25 

8 

1.24  (0.02) 

3.02e-10  (9.51e-ll) 

16 

1.52  (0.03) 

4.54e-ll  (3.86e-ll) 

32 

1.85  (0.04) 

O.OOe-00  (O.OOe-00) 

2 

0.92  (0.02) 

2.14e-09  (1.29e-09) 

0.50 

4 

1.00  (0.02) 

2.53e-10  (l.lOe-10) 

8 

1.11  (0.02) 

7.61e-ll  (5.02e-ll) 

16 

1.27  (0.03) 

O.OOe-00  (O.OOe-00) 

2 

1.14  (0.02) 

4.14e-10  (2.84e-10) 

0.75 

4 

1.19  (0.02) 

2.40e-ll  (1.67e-ll) 

8 

1.27  (0.02) 

1.18e-ll  (1.18e-ll) 

16 

1.44  (0.03) 

O.OOe-00  (O.OOe-00) 

2 

12.14  (0.02) 

1.66e-10  (5.18e-ll) 

1.0 

4 

12.19  (0.02) 

4.85e-ll  (3.49e-ll) 

8 

12.28  (0.01) 

O.OOe-00  (O.OOe-00) 

Table  4.1:  Convergence  results  for  ERPS  (n  =  10,  r  =  10)  based  on  30  independent 
replications.  The  standard  errors  are  in  parentheses. 

To  see  how  the  computational  complexity  of  ERPS  changes  with  the  size  of  the 
action  space,  we  test  ERPS  on  several  MDPs  with  increasing  numbers  of  actions;  for  each 
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problem,  the  foregoing  setting  is  used  except  that  the  action  space  now  takes  the  form 
=  {hk  :  A:  =  0, 1, . . . ,  where  h  is  the  mesh  size,  selected  sequentially  (one  for  each 

problem)  from  the  set  {  igg  ’  250  ’  500  ’  Togo  ’  2500  ’  5050  ’  10000  ’  25000  ’  50000  ’  100000  ’  200000  }  ’ 
the  size  of  the  action  space  \Ah\  =  ^  +  1- 

We  plot  in  Figure  4.3  the  running  time  required  for  PI  and  ERPS  to  find  the 
optimal  solutions  as  a  function  of  the  number  of  actions  of  each  MDP  considered,  where 
the  results  for  ERPS  are  the  averaged  time  over  30  independent  replications.  Empirical 
results  indicate  that  the  computational  time  for  PI  increases  linearly  in  the  number  of 
actions  (due  to  the  requirement  of  enumerating  the  action  space),  while  the  running  time 
required  for  ERPS  does  so  in  an  asymptotic  sense.  However,  ERPS  significantly  reduces 
the  computational  efforts  of  PI  by  roughly  a  factor  of  14  when  the  size  of  the  action 
space  is  large  (number  of  actions  greater  than  10^).  We  see  that  ERPS  also  delivers 
very  competitive  performances  even  when  the  action  space  is  small.  In  the  experiments, 
we  used  a  search  range  r  =  10  in  ERPS,  regardless  of  the  size  of  the  action  space;  we 
believe  the  performance  of  the  algorithm  could  be  enhanced  by  using  a  search  range  that 
is  proportional  to  the  size  of  the  action  space.  Moreover,  the  computational  effort  of  ERPS 
can  be  reduced  considerably  if  we  are  seeking  solutions  within  some  required  accuracy  of 
the  optimum  rather  than  searching  for  the  exact  optimal  solution. 

Eor  case  (ii),  as  expected,  since  the  sine  function  is  not  monotone,  the  resultant 
MDP  problem  has  a  very  high  number  of  local  minima;  some  typical  locally  optimal 
policies  are  shown  in  Pigure  4.4. 

We  applied  both  EPI  and  ERPS  to  this  case,  where  both  algorithms  start  with  the 
same  initial  population.  The  convergence  of  EPI  and  ERPS  is  shown  in  Table  4.2.  The 
computational  time  required  for  PI  to  find  the  optimal  value  function  J*  was  14  seconds. 


85 


number  of  actions  number  of  actions 


(a)  (b) 

Figure  4.3:  Running  time  required  for  PI  &  ERPS  (n  =  10,  r  =  10,  based  on  30  indepen¬ 
dent  replications)  to  find  the  optimal  solutions  to  MDPs  with  different  numbers  of  actions, 
(a)  using  log-scale  for  horizontal  axis;  (b)  using  log-log  plot. 


state  state 


Figure  4.4:  Four  typical  locally  optimal  solutions  to  the  test  problem. 
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and  the  magnitude  of  ||  J*||oo  is  approximately  1.03e+05.  For  EPI,  we  have  tested  different 
sets  of  parameters  (recall  from  Chapter  4.3.3  that  Pm  is  the  mutation  probability;  and 
Pg  (Pi)  are  the  predefined  global  (local)  mutation  probabilities);  the  results  reported  in 
Table  4.2  are  the  best  results  obtained.  Also  note  that  because  of  the  slow  convergence  of 
EPI,  the  values  for  the  stopping  control  parameter  K  are  chosen  much  larger  than  those 
for  ERPS. 


algorithms 

stop  rule  (K) 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

EPI 

20 

2.13  (0.11) 

1.74e-02  (1.35e-03) 

Pm  =  0.1 

40 

3.80  (0.16) 

1.12e-02  (8.81e-04) 

Pg  =  0.9 

80 

6.63  (0.34) 

7.13e-03  (5.37e-04) 

Pi  =  0.1 

160 

16.30  (0.59) 

3.22e-03  (2.26e-04) 

2 

1.03  (0.02) 

9.81e-05  (5.17e-05) 

ERPS 

4 

1.12  (0.03) 

7.12e-05  (4.95e-05) 

qo  =  0.5 

8 

1.28  (0.03) 

2.37e-05  (1.64e-05) 

r  =  10 

16 

1.50  (0.03) 

1.06e-09  (6.59e-10) 

32 

1.86  (0.04) 

O.OOe-00  (O.OOe-00) 

Table  4.2:  Convergence  results  for  EPI  (n  =  10)  &  ERPS  (n  =  10,  r  =  10)  based  on  30 
independent  replications.  The  standard  errors  are  in  parentheses. 

To  see  how  the  exploitation  probability  qo  affects  the  performance  of  ERPS,  a  set  of 
experiments  is  also  performed  by  fixing  the  stopping  control  parameter  K  =  10  and  varying 
Qo-  The  numerical  results  are  recorded  in  Table  4.3,  where  Nopt  indicates  the  number  of 
times  an  optimal  solution  was  found  out  of  30  trials.  The  go  =  1.0  case  corresponds  to  pure 
local  search.  Obviously  in  this  case,  the  algorithm  gets  trapped  into  a  local  minimum, 
which  has  a  mean  relative  error  of  5.62e-3.  However,  note  that  the  standard  error  is 
zero,  which  means  that  the  local  minimum  is  estimated  with  very  high  precision.  This 
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shows  that  the  “nearest  neighbor”  heuristic  is  indeed  useful  in  fine-tuning  the  solutions. 
In  contrast,  the  pure  random  search  (go  =  0)  case  is  helpful  in  escaping  from  the  local 
minima,  yielding  a  lower  mean  relative  error  of  2.59e-5,  but  it  is  not  very  good  in  locating 
the  exact  optimal  solutions,  as  none  was  found  out  of  30  trials.  Roughly,  increasing  go 
between  0  and  0.5  leads  to  a  more  accurate  estimation  of  the  optimal  solution;  however, 
increasing  go  on  the  range  0.6  to  1.0  decreases  the  quality  of  the  solution,  because  the  local 
search  part  begins  to  gradually  dominate,  so  that  the  algorithm  is  more  easily  trapped  in 
local  minima.  This  also  explains  why  we  have  larger  variances  when  go  =  0.6,  0.7, 0.8, 0.9 
in  Table  4.3.  Notice  that  the  algorithm  is  very  slow  in  the  pure  local  search  case;  setting 
go  <  1  speeds  up  the  algorithm  substantially. 


qo 

Avg.  time  (std  err) 

Nopi 

mean  relerr  (std  err) 

0.0 

3.30  (0.13) 

0 

2.59e-05  (6.19e-06) 

0.1 

1.96  (0.04) 

5 

4.51e-08  (8.60e-09) 

0.2 

1.48  (0.03) 

12 

1.26e-08  (3.47e-09) 

0.3 

1.39  (0.02) 

24 

2.74e-09  (2.02e-09) 

0.4 

1.28  (0.02) 

25 

2.69e-05  (1.89e-05) 

0.5 

1.32  (0.03) 

27 

8.75e-10  (6.01e-10) 

0.6 

1.41  (0.04) 

25 

6.19e-05  (3.20e-05) 

0.7 

1.50  (0.04) 

22 

1.53e-04  (6.96e-05) 

0.8 

1.81  (0.04) 

15 

3.04e-04  (7.09e-05) 

0.9 

2.33  (0.08) 

11 

7.99e-04  (1.63e-04) 

1.0 

7.86  (0.02) 

0 

5.62e-03  (O.OOe-00) 

Table  4.3:  Performance  of  ERPS  with  different  exploitation  probabilities  (n  =  10,  K  = 
10,  r  =  10)  based  on  30  independent  replications.  The  standard  errors  are  in  parentheses. 

To  provide  a  numerical  comparison  between  the  “nearest  neighbor”  heuristic  (biased 
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algorithms 

parameters 

Avg.  time 

actual  relerr  (std  err) 

O 

O 

II 

o 

13.31  (0.60) 

7.63e-07  (3.71e-08) 

go  =  0.1 

1.20  (0.03) 

4.99e-07  (5.47e-08) 

ERPS 

CO 

O 

II 

O 

0.96  (0.04) 

3.26e-07  (4.83e-08) 

r  =  10 

LO 

O 

II 

o 

Oth 

0.97  (0.03) 

3.84e-07  (5.08e-08) 

go  =  0.7 

1.61  (0.18) 

3.47e-07  (4.91e-08) 

go  =  0.9 

4.03  (0.62) 

2.33e-07  (4.62e-08) 

Pm  =  0.1, 

Pg  =  0.9, 

Pi  =  0.1 

62.4  (3.0) 

7.61e-07  (3.67e-08) 

Pm  =  0.3, 

Pg  =  0.9, 

Pi  =  0.1 

33.3  (1.4) 

8.42e-07  (2.76e-08) 

ALG.  1 

Pm  =  0.5, 

Pg  =  0.9, 

Pi  =  0.1 

26.6  (1.4) 

8.35e-07  (2.93e-08) 

II 

o 

Pg  =  0.9, 

Pi  =  0.1 

22.1  (1.2) 

7.88e-07  (3.34e-08) 

Pm  =  0.9, 

Pg  =  0.9, 

Pi  =  0.1 

20.2  (1.1) 

8.44e-07  (2.55e-08) 

Pm  =  1.0, 

Pg  =  1.0, 

Pi  =  0.0 

17.6  (0.9) 

7.67e-07  (4.08e-08) 

Table  4.4:  Average  time  required  to  reach  a  precision  of  at  least  l.Oe-6  for  different  algo¬ 
rithms.  All  results  are  based  on  30  independent  replications.  The  standard  errors  are  in 
parentheses. 

sampling)  and  the  policy  mutation  procedure  (unbiased  sampling),  we  call  the  algorithm 
with  the  PICS  step  but  policy  mutation  procedure  as  algorithm  1.  In  both  ERPS  and 
algorithm  1,  we  fix  the  population  size  n  =  10,  and  stop  the  algorithms  only  when  a 
desired  accuracy  is  reached.  In  Table  4.4,  we  record  the  length  of  time  required  for 
different  algorithms  to  reach  a  relative  error  of  at  least  l.Oe-6.  Indeed,  we  see  that  ERPS 
uses  far  less  time  to  reach  a  required  accuracy  than  algorithm  1  does. 

Continuous  Action  Space 

We  test  the  algorithm  when  the  action  space  A  is  continuous,  where  the  service 
completion  probability  can  be  any  value  between  0  and  1.  Again,  two  cost  functions  are 
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considered,  corresponding  to  cases  (i)  and  (ii)  in  the  discrete  action  space  examples.  In 
both  cases,  the  maximum  queue  length  £,  state  space  X,  and  the  discount  factor  a  are 
all  taken  to  be  the  same  as  before. 

In  the  numerical  experiments,  we  approximated  the  optimal  costs  and  for  each 
of  the  respective  cases  (i)  and  (ii)  by  two  value  functions  and  J|,  which  were  computed 
by  using  the  adaptive  ERPS  algorithm  under  the  following  parameter  settings:  population 
size  n  =  10;  stopping  control  parameter  K  =  10;  exploitation  probability  (70  =  0.5;  initial 
search  range  tolerance  e  =  le-12  for  case  (i)  and  e  =  le-10  for  case  (ii);  Ki  =  5; 

K2  =  5;  it's  =  5;  7  =  2.  We  performed  200  independent  runs  of  the  adaptive  ERPS 
algorithm  for  each  case,  and  (J2)  was  obtained  as  the  best  solution  out  of  the  200 
replications. 

We  set  the  population  size  n  =  10,  termination  control  parameter  K  =  10,  and  test 
the  ERPS  algorithm  by  using  different  values  of  the  search  range  r.  The  performance 
of  the  algorithm  is  also  compared  with  that  of  a  deterministic  policy  iteration  (PI)  algo¬ 
rithm,  where  we  first  uniformly  discretize  the  action  space  into  evenly  spaced  points  by 
using  a  mesh  size  h,  and  then  apply  the  standard  PI  algorithm  on  the  discretized  prob¬ 
lem.  Tables  4.5  and  4.6  give  the  performances  of  both  algorithms  for  cases  (i)  and  (ii), 
respectively.  Note  that  the  relative  errors  are  actually  computed  by  replacing  the  optimal 
costs  with  their  corresponding  approximations  in  equation  (4.12). 

Test  results  indicate  that  ERPS  outperforms  the  discretization-based  PI  algorithm 
in  both  cases,  not  only  in  computational  time  but  also  in  solution  quality.  We  observe 
that  the  computational  time  for  PI  increases  by  a  factor  of  2  for  each  halving  of  the  mesh 
size,  while  the  time  for  ERPS  increases  at  a  much  slower  rate. 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

qo  =  0.25 

2.54  (0.10) 

1.92e-12  (3.64e-13) 

ERPS 

qo  =  0.50 

2.27  (0.09) 

6.41e-13  (7.07e-14) 

“  iooo) 

go  =  0.75 

2.92  (0.08) 

1.92e-13  (2.69e-14) 

go  =  0.25 

2.61  (0.10) 

4.66e-13  (6.03e-14) 

ERPS 

go  =  0.50 

2.91  (0.10) 

1.08e-13  (1.59e-14) 

^  8(^) 

go  =  0.75 

3.05  (0.11) 

6.84e-14  (1.03e-14) 

go  =  0.25 

2.84  (0.09) 

1.33e-13  (2.35e-14) 

ERPS 

go  =  0.50 

3.25  (0.10) 

3.06e-14  (4.56e-15) 

“  16000  ) 

go  =  0.75 

3.68  (0.10) 

1.89e-14  (2.50e-15) 

h  — 

6  (N/A) 

7.96e-09  (N/A) 

h  — 

12  (N/A) 

1.72e-09  (N/A) 

PI 

h  —  ^ 

23  (N/A) 

4.74e-10  (N/A) 

h  —  ^ 

47  (N/A) 

9.52e-ll  (N/A) 

h  —  ^ 

191  (N/A) 

6.12e-12  (N/A) 

h  —  ^ 

^  512000 

781  (N/A) 

3.96e-13  (N/A) 

Table  4.5:  Comparison  of  the  ERPS  algorithm  (n  =  10,  K  =  10)  with  the  deterministic 
PI  algorithm  for  case  (i).  The  results  of  ERPS  are  based  on  30  independent  replications. 
The  standard  errors  are  in  parentheses. 

4.6.2  A  Two-Dimensional  Queueing  Example 

The  second  example,  shown  in  Pigure  4.5,  is  a  slight  modification  of  the  first  one, 
with  the  difference  being  that  now  we  have  a  single  queue  that  feeds  two  independent 
servers  with  different  service  completion  probabilities  oi  and  02.  We  consider  only  the 
continuous  action  space  case.  The  action  to  be  chosen  at  each  state  x  is  (01,02)^,  which 
takes  value  from  the  set  A  =  [0, 1]  x  [0, 1].  We  assume  that  an  arrival  that  finds  the  system 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

qo  =  0.25 

2.75  (0.10) 

8.49e-ll  (1.50e-ll) 

ERPS 

qo  =  0.50 

2.91  (0.09) 

1.76e-ll  (2.90e-12) 

“  iooo) 

go  =  0.75 

3.16  (0.09) 

8.53e-12  (1.21e-12) 

go  =  0.25 

3.09  (0.12) 

1.70e-ll  (2.57e-12) 

ERPS 

go  =  0.50 

3.00  (0.12) 

4.17e-12  (4.94e-13) 

^  8(^) 

go  =  0.75 

3.62  (0.08) 

1.55e-12  (1.47e-13) 

go  =  0.25 

3.20  (0.10) 

6.08e-12  (1.17e-12) 

ERPS 

go  =  0.50 

3.28  (0.11) 

1.19e-12  (1.40e-13) 

“  16000  ) 

go  =  0.75 

4.20  (0.12) 

4.25e-13  (5.05e-14) 

h  — 

6  (N/A) 

2.71e-07  (N/A) 

h  — 

11  (N/A) 

5.66e-08  (N/A) 

PI 

h  —  ^ 

22  (N/A) 

1.58e-08  (N/A) 

h  —  ^ 

43  (N/A) 

5.21e-09  (N/A) 

h  —  ^ 

176  (N/A) 

3.58e-10  (N/A) 

h  —  ^ 

^  512000 

727  (N/A) 

1.71e-ll  (N/A) 

Table  4.6:  Comparison  of  the  ERPS  algorithm  (n  =  10,  K  =  10)  with  the  deterministic 
PI  algorithm  for  case  (ii).  The  results  of  ERPS  are  based  on  30  independent  replications. 
The  standard  errors  are  in  parentheses. 

empty  will  always  be  served  by  the  server  with  service  completion  probability  oi.  The 
state  space  of  this  problem  is  X  =  {0,  Is^,  1^2 >  2, . . . ,  48},  where  we  have  assumed  that 
the  maximum  queue  length  (no  including  those  in  service)  is  46,  and  15^,153  are  used 
to  distinguish  the  situations  whether  server  1  or  server  2  is  busy  when  there  is  only  one 
customer  in  the  system.  As  before,  the  discount  factor  a  =  0.98. 
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The  one-stage  cost  is  taken  to  be 


R{y,ai,a2) 


■|A| 

2 

■|A| 

y  + 

—  cos(7rai)  -  y 

hsi}  + 

—  sm(7ra2)  -  y 

hs2}^ 


where 


hs^} 


1  if  server  i  is  busy, 

<  (i  =  1, 2),  and  y 

0  otherwise, 


1  ifxG'[l5']^,l5'2}'? 

< 

X  otherwise. 


p=0.2 

- ^ 


departure 
- ^ 


Figure  4.5:  A  two-dimensional  queueing  example. 

Again,  in  computing  the  relative  error,  we  approximated  J*  by  J*,  which  was  com¬ 
puted  by  using  the  adaptive  ERPS  algorithm  under  the  same  settings  (e.g.,  parameter 
settings,  number  of  replications)  as  in  case  (ii)  of  the  discrete  action  space  examples. 
The  value  of  ||J*||oo  is  approximately  1.72e-|-04. 

The  performances  of  the  ERPS  and  the  discretization-based  PI  are  reported  in 
Table  4.7.  In  ERPS,  both  the  population  size  n  and  the  stopping  control  parameter  K  are 
set  to  10.  In  PI,  we  adopt  a  uniform  discretization,  where  the  same  mesh  size  h  is  used  in 
both  directions  of  the  action  space.  Notice  that  the  computational  time  for  PI  increases 
by  a  factor  of  4  for  each  halving  of  the  mesh  size,  whereas  the  time  required  by  ERPS 
increases  much  more  slowly. 

In  Table  4.8,  we  compare  the  performance  of  the  adaptive  ERPS  algorithm  and  the 
original  ERPS  algorithm  in  obtaining  high  quality  solutions.  In  both  algorithms,  we  choose 
the  population  size  n  =  10,  the  stopping  control  parameter  K  =  10,  and  the  exploitation 
probability  go  =  0.5.  In  adaptive  ERPS,  the  initial  search  range  r  =  0.1,  7  =  2,  parameters 
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algorithms 

parameters 

Avg.  time  (std  err) 

mean  relerr  (std  err) 

qo  =  0.25 

3.26  (0.14) 

2.60e-06  (1.36e-07) 

ERPS 

go  =  0.50 

3.20  (0.15) 

1.06e-05  (9.17e-06) 

go  =  0.75 

3.64  (0.14) 

8.98e-05  (2.54e-05) 

go  =  0.25 

3.37  (0.12) 

6.67e-07  (3.59e-08) 

ERPS 

qo  =  0.50 

3.28  (0.12) 

9.58e-06  (9.20e-06) 

go  =  0.75 

3.89  (0.17) 

9.38e-05  (2.47e-05) 

go  =  0.25 

3.78  (0.11) 

1.50e-07  (8.30e-09) 

ERPS 

go  =  0.50 

3.85  (0.12) 

9.30e-06  (9.21e-06) 

(^=4®) 

go  =  0.75 

4.45  (0.14) 

4.59e-05  (1.90e-05) 

^  ~  Too 

15  (N/A) 

1.65e-04  (N/A) 

PI 

h  —  — 

57  (N/A) 

4.30e-05  (N/A) 

h  —  — 

400 

226  (N/A) 

8.87e-06  (N/A) 

Table  4.7:  A  two-dimensional  test  example.  The  results  of  ERPS  are  based  on  30  inde¬ 
pendent  replications  (n  =  10,  K  =  10). 


Ai,  K2  and  A3  are  all  set  to  5,  and  the  improvements  in  elite  policies  are  evaluated  in 
the  infinity-norm.  We  see  that  in  order  to  obtain  more  and  more  accurate  solutions,  the 
search  range  in  ERPS  has  to  be  chosen  excessively  small,  which  causes  significant  increase 
in  computational  effort.  In  contrast,  the  adaptive  ERPS  achieves  better  solutions  within 
less  time;  moreover,  the  algorithm  provides  us  with  a  rough  estimation  of  the  solution 
quality:  as  mentioned  in  Chapter  4.5,  the  average  difference  between  the  resultant  value 
function  J  and  the  optimal  cost  J*  (i.e.,  ||J— J*||oo)  will  be  of  the  same  order  of  magnitude 
as  e;  and  the  relative  error  can  also  be  estimated  as: 


relerr 


\J-J* 


\J-J* 


e 
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algorithms 

parameters 

Avg.  time 

mean  relerr  (stderr) 

<7  —  J*  00  (stderr) 

^  “  20000 

16.4  (0.2) 

2.25e-ll  (8.88e-13) 

N/A  (N/A) 

ERPS 

r  -  ^ 

'  40000 

24.8  (0.3) 

5.04e-12  (1.95e-13) 

N/A  (N/A) 

r  =  ^ 

'  80000 

39.1  (0.5) 

1.02e-12  (7.18e-14) 

N/A  (N/A) 

Adaptive 

e  =le-07 

13.8  (0.7) 

9.28e-12  (3.22e-12) 

1.59e-07  (5.54e-08) 

ERPS 

e  =le-08 

15.7  (0.8) 

3.95e-13  (1.67e-13) 

6.80e-09  (2.87e-09) 

e  =le-09 

17.1  (0.7) 

1.09e-13  (3.12e-14) 

1.87e-09  (5.37e-10) 

Table  4.8:  Comparison  of  ERPS  (n  =  10,  =  10,  qo  =  0.5)  with  adaptive  ERPS 

(n  =  10,  K  =  10,  qo  =  0.5,  r  =  0.1,  Ki  =  K2  =  K3  =  5,  7  =  2),  based  on  30 
independent  replications. 

4.7  Conclusions  and  Open  Problems 

We  presented  an  evolutionary,  population-based  method  called  ERPS  for  solving  in¬ 
finite  horizon  discounted  cost  MDP  problems.  We  showed  that  the  algorithm  converges  to 
an  optimal  policy  w.p.l.  We  also  illustrated  the  algorithm  by  applying  it  to  two  controlled 
queueing  examples  with  large  or  uncountable  action  spaces.  Numerical  experiments  on 
these  small  examples  indicate  that  the  ERPS  algorithm  is  a  promising  approach,  outper¬ 
forming  some  existing  methods  (including  the  standard  policy  iteration  algorithm). 

Many  challenges  remain  to  be  addressed  before  the  algorithm  can  be  applied  to 
realistic-sized  problems.  The  motivation  behind  ERPS  is  the  setting  where  the  action  space 
is  extremely  large  so  that  enumerating  the  entire  action  space  becomes  computationally 
impractical;  however,  the  approach  still  requires  enumerating  the  entire  state  space.  To 
make  it  applicable  to  large  state  space  problems,  the  algorithm  will  probably  need  to 
be  used  in  conjunction  with  some  other  state  space  reduction  techniques  such  as  state 
aggregation  or  value  function  approximation.  This  avenue  of  investigation  clearly  merits 
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further  research. 


Another  important  issue  is  the  dependence  of  ERPS  on  the  underlying  distance 
metric,  as  determining  a  good  metric  could  be  challenging  for  those  problems  that  do 
not  have  a  natural  metric  already  available.  One  possible  way  to  get  around  this  is  to 
adaptively  updating/changing  the  action  selection  distribution  V  at  each  iteration  of  the 
algorithm  based  on  the  sampling  information  obtained  during  the  previous  iterations.  This 
actually  constitutes  a  learning  process;  the  hope  is  that  more  promising  actions  will  have 
larger  chances  of  being  selected  so  that  the  future  search  will  be  biased  toward  the  region 
containing  high  quality  solutions  (policies). 

Another  practical  issue  is  the  choice  of  the  exploitation  probability  go-  -A-S  noted 
earlier,  the  parameter  go  serves  as  a  tradeoff  between  exploitation  and  exploration  in 
action  selections.  Preliminary  experimental  results  indicate  some  robustness  with  respect 
to  the  value  of  this  parameter,  in  that  values  between  0.25  and  0.75  all  seem  to  work  well; 
however,  this  may  not  hold  for  larger  problems  or  other  settings,  so  further  investigation  is 
required.  One  approach  is  to  design  a  similar  strategy  as  in  simulated  annealing  algorithms 
and  study  the  behavior  of  the  algorithm  when  the  value  of  go  is  gradually  increasing  from 
0  to  1 ,  which  corresponds  to  the  transitioning  of  the  search  mechanism  from  pure  random 
sampling  to  pure  local  search. 
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Chapter  5 

A  Model  Reference  Adaptive  Search  Method  for  Global  Optimization 
5.1  Introduction  and  Motivation 

The  focus  of  this  chapter  is  on  the  development  of  a  new  randomized  search  frame¬ 
work  we  call  model  reference  adaptive  search  (MRAS)  for  solving  both  continuous  and 
combinatorial  (deterministic)  global  optimization  problems.  Similar  to  what  has  been 
done  in  the  field  of  machine  learning  and  the  work  of  [91],  we  characterize  the  existing 
general  purpose  global  optimization  techniques  as  being  either  instance-based  or  model- 
based,  please  refer  to  Chapter  1.2  for  a  review.  Over  the  past  few  decades,  a  significant 
amount  of  research  effort  has  been  centered  around  classical  instance-based  methods. 
Thus,  the  behavior  of  these  methods  is  relatively  well  understood.  However,  the  model- 
based  methods  is  still  merely  a  collection  of  independently  developed  heuristic  methods, 
without  concrete  theoretical  foundations.  The  main  contribution  of  this  research  is  to 
provide  a  new  unifying  framework  that  addresses  the  most  common  computational  diffi¬ 
culties  faced  by  many  model-based  methods  and  to  propose  a  simple  way  of  constructing 
a  class  of  model-based  optimization  algorithms  with  theoretical  performance  guarantee. 

A  schematic  description  of  the  model-based  search  method  is  given  in  Figure  5.1.  In 
model-based  methods,  there  is  often  an  intermediate  probabilistic  model  over  the  solution 
space,  and  at  each  iteration  of  these  approaches  new  solutions  are  sampled/generated 
from  the  current  probabilistic  model;  the  performance  of  these  candidate  solutions  are 
then  evaluated  and  thus  used  to  update  the  current  model  according  to  some  pre-specified 
updating  mechanism. 
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Figure  5.1:  A  description  of  the  model-based  methods 

As  we  can  see,  there  are  two  key  questions  we  need  to  address  in  model-based  search 
methods.  The  first  question  is,  of  course,  how  to  update  the  probabilistic  model.  For  exam¬ 
ple,  traditional  estimation  of  distribution  algorithms  (EDAs)  (Chapter  1.2,  Chapter  2.2) 
use  an  explicit  construction  procedure,  and  try  to  build  an  empirical  distribution  over  the 
solution  space.  The  updating  of  these  empirical  distributions  is  then  usually  carried  out 
at  each  iteration  either  via  measuring  sample  frequencies  or  by  using  the  maximum  like¬ 
lihood  estimation  technique.  However,  the  difficulty  is  that  these  empirical  distributions 
need  to  be  tailored  to  specific  problems.  For  more  complex  problems,  it  is  often  tempting 
to  use  more  complicated  models  to  improve  the  performance  of  these  methods,  but  the 
model  construction  and  updating  cost  could  be  computationally  expensive.  Moreover,  for 
the  type  of  “black-box”  problems,  where  nothing  or  little  is  known  about  the  structure  of 
the  underlying  problem,  how  to  choose  the  most  appropriate  empirical  model  is  a  difficult 
issue.  In  contrast  to  the  first  key  question,  another  extreme  is  that  oftentimes  one  may 
have  a  nice  sequence  of  probabilistic  models,  however  how  to  sample  from  these  distri¬ 
bution  is  a  big  issue.  For  instance,  as  we  have  discussed  in  Chapter  2.2,  in  annealing 
adaptive  search  (AAS),  the  majority  of  the  computational  effort  is  not  spent  in  updating 
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Boltzmann  distributions,  but  in  how  to  efficiently  generate  samples/candidate  solutions 
from  these  distributions.  These  fundamental  issues  in  model-based  search  method  are  the 
motivation  behind  the  MRAS  method. 

5.2  The  Model  Reference  Adaptive  Search  Method 

The  Model  Reference  Adaptive  Search  method  tries  to  address  the  aforementioned 
difficulties  in  the  following  way.  A  high-level  description  of  the  framework  is  shown  in  Fig¬ 
ure  5.2,  where  we  split  the  components  of  MRAS  into  two  groups.  The  components  in  the 
red  dashed  box  in  the  figure  address  the  issue  of  how  to  sample,  whereas  the  components 
in  the  blue  box  are  responsible  for  the  issue  of  how  to  update  distributions.  In  MRAS, 
instead  of  using  arbitrary  (empirical)  distributions  (as  in  EDAs),  we  use  a  family  of  pa¬ 
rameterized  distributions  as  sampling  distributions  to  generate  candidate  solutions.  The 
hope  is  that  this  parameterized  family  is  specified  with  some  structure  so  that  once  the 
parameter  is  determined,  sampling  from  each  of  these  distributions  should  be  a  relatively 
easy  task.  An  additional  advantage  by  using  the  parameterized  family  is  that  the  task  of 
updating  (empirical)  sampling  distributions  now  simplifies  to  the  task  of  updating  para¬ 
meters  associated  with  the  distribution  family.  At  each  iteration  of  MRAS,  the  parameter 
is  determined  by  minimizing  certain  distance  between  the  parameterized  family  and  an 
additional  sequence  of  distributions  we  call  reference  distributions.  These  reference  dis¬ 
tributions  are  primarily  used  to  guide  the  parameter  updating  process  and  to  express  the 
desired  properties  of  the  framework.  Thus,  to  ensure  the  convergence  of  the  framework, 
we  often  want  to  construct  these  distributions  so  that  they  will  converge  to  a  degenerated 
distribution  concentrated  only  on  the  optimum.  Intuitively,  among  the  parameterized 
family,  the  current  sampling  distribution  can  be  viewed  as  a  compact  approximation  of 
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Figure  5.2:  A  schematic  description  of  the  MRAS  framework 

the  reference  distribution  (the  projection  of  the  reference  distribution  on  the  parameter¬ 
ized  family),  and  may  hopefully  retain  some  nice  properties  of  these  distributions.  Thus, 
as  the  sequence  of  reference  distributions  converges,  the  sequence  of  samples  generated 
from  their  compact  approximations  (i.e.,  sampling  distributions)  should  also  converge  to 
the  optimum.  Since  this  idea  is  very  similar  to  the  use  of  reference  models  in  adaptive 
control,  we  call  this  method  model  reference  adaptive  search. 

5.3  The  MRASo  Algorithm  (Exact  Version) 

We  consider  the  optimization  problem  introduce  in  Chapter  2.2: 

X*  G  argmaxR(x),  x  G  A  C  JR”,  (5.1) 

x&X 

where  the  solution  space  A  is  a  non-empty  set  in  and  i/(-)  :  A  — >  SR  is  a  deterministic 
function  that  is  bounded  from  below,  i.e.,  3A1  >  — oo  such  that  H{x)  >  Vx  G  A. 
We  will  not  impose  any  further  structural  (continuity,  differentiability)  assumptions  on 
H{-).  Thus,  in  our  setting,  we  are  interested  in  general  optimization  problems  with  little 
structure  or  the  cases  where  H[-)  does  have  some  structures  but  these  structures  are  not 
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known  as  a  priori.  We  assume  that  the  global  optimal  solution  to  (5.1)  exists  and  is 
unique,  i.e.,  3x*  ^  X  such  that  H{x)  <  H{x*)  Vx  /  x*,  x  G  X,  however  we  note  that 
the  problem  may  have  many  locally  optimal  solutions. 

MRAS  works  with  a  family  of  parameterized  distribution  {f{-,9),  9  G  0},  where  0 
is  the  parameter  space.  The  parameter  updating  in  MRAS  is  determined  by  a  sequence  of 
reference  distributions  {s'fcC')}-  particular,  at  each  iteration  k,  we  look  at  the  projection 
of  gk{-)  on  the  family  of  distributions  {f{-,9),  0  G  0}  and  compute  the  new  parameter 
vector  9k+i  that  minimizes  the  Kullback-Leibler  (KL)  divergence 

I>(a,  /(■• «))  :=  E,.  [in  In 

where  n  is  the  Lebesgue/counting  measure  defined  on  X,  X  =  (Xi, . . .  ,Xn)  is  a  random 
vector  taking  values  in  X,  and  denotes  the  expectation  taken  with  respect  to  gk{-)- 

Intuitively  speaking,  /(•,  9k+i)  can  be  viewed  as  a  compact  representation  of  the  reference 
distribution  gk{-)',  consequently,  the  feasibility  and  effectiveness  of  the  algorithm  will,  to 
some  large  extent,  depend  on  the  choices  of  the  reference  distributions. 

As  we  can  see  from  Chapter  5.2,  there  is  a  lot  of  flexibilities  in  the  choices  of  reference 
distributions.  So  we  can  construct  different  instantiations  of  the  framework  by  selecting 
different  sequences  of  reference  distributions.  We  now  analyze  a  particular  instantiation 
of  the  framework  we  call  MRASq  by  explicitly  specifying  a  simple  iterative  scheme  for 
constructing  the  sequence  of  reference  distributions. 

Let  gQ{x)  >  0  Vx  G  A  be  an  initial  probability  density/mass  function  (p.d.f./p.m.f.) 
on  the  solution  space  X.  At  each  iteration  /c  >  1,  we  compute  a  new  p.d.f./p.m.f.  by 
tilting  the  old  p.d.f./p.m.f.  gk-i{x)  with  the  performance  function  H{x)  (for  simplicity. 
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here  we  assume  H(x)  >  0  Vx  G  X),  i.e. 


gk{x)  =  r  s - .  s  .  .  s  ,  VxeX. 


(5.2) 


H{x)gk-i{x)u{dx) ' 

By  doing  so,  we  are  assigning  more  weight  to  solutions  that  have  better  performance.  One 
direct  consequence  of  this  is  that  each  iteration  of  (5.2)  improves  the  expected  performance. 
To  be  precise, 


EgdH{X)]  = 


Eg,.AH{X)] 

>  Eg,_AH{X)]. 


Furthermore,  it  is  possible  to  show  that  the  sequence  {gki'),  A;  =  0, 1, . . .}  will  converge 
to  a  distribution  that  concentrates  only  on  the  optimal  solution  for  arbitrary  go{-)-  So 
we  will  have  lim^^oo  Eg^E^iX)]  =  H{x*).  The  above  idea  has  previously  been  used,  for 
example,  in  ED  As  with  proportional  selection  schemes  (cf.  e.g.,  [90]),  and  in  randomized 
algorithms  for  solving  Markov  decision  processes  ([20]).  However,  in  those  approaches,  the 
construction  of  gk{-)  in  (5.2)  needs  to  be  carried  out  explicitly  to  generate  new  samples; 
moreover,  since  gk{')  may  not  have  any  structure,  sampling  from  it  could  be  computation¬ 
ally  expensive.  In  MRAS,  these  difficulties  are  circumvented  by  projecting  gk{-)  on  the 
family  of  parameterized  distributions  {/(•,0)}.  On  the  one  hand,  f{-,0k)  often  has  some 
special  structure  and  therefore  could  be  much  easier  to  handle,  and  on  the  other  hand, 
the  sequence  {/(•,  0k+i),  /c  =  0, 1, . . .}  may  retain  some  nice  properties  of  {5'fc(-)}  and  also 
converge  to  a  degenerate  distribution  concentrated  on  the  optimal  solution. 


5.3.1  Algorithm  Description 

Throughout  the  analysis,  we  use  Pe,,{-)  and  EqA']  to  denote  the  probability  and 
expectation  taken  with  respect  to  the  p.d.f./p.m.f.  f{-,0k),  and  I|.}  to  denote  the  indicator 


102 


function,  i.e. 


I  {A}  ■■= 


if  event  A  holds, 
otherwise. 


Thus,  under  our  notational  convention. 


Pe. 


{H{X)>j)=  [  Ii^H{x)>'i}f{x,Ok)v{dx)  and  E0^[H{X)\  =  [  H{x)f{x,ek)v{dx) 
J  X  Jx 


Algorithm  MRASq  —  exact  version 


•  Initialization:  Specify  p  G  (0,1],  a  small  number  £  >  0,  a  strictly  increasing  function 
S{-)  :  3?  ^  3?+,  and  an  initial  p.d.f./p.m.f.  f{x,9o)  >  0  Vx  S  A.  Set  the  iteration  counter 
fc  =  0. 


•  Repeat  until  a  specified  stopping  rule  is  satisfied: 


1.  Calculate  the  (1  —  p)-quantile 


7fe+i  :=  sup  {I  :  Pg^{H{X)  >  1)  >  p}  . 

i 


2.  if  fc  =  0,  then  set  7fc+i  =  7fc+i. 
elseif  fc  >  1 

if  7fc+i  >  7fe  +  e.  then  set  7fc+i  =  7fe+i. 

else  set  %+i  = 

endif 


endif 


3.  Compute  the  parameter  vector  9k+i  as 


0fc+i  :=  argmaxife^^ 

eee 


f{X,9k) 


ln/(A,0)  , 


4.  Set  fc  =  fc  +  1. 


(5.3) 


The  MRASq  algorithm  requires  specification  of  a  parameter  p,  which  determines 
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the  approximate  proportion  of  samples  that  will  be  used  to  update  the  probabilistic 
model.  At  successive  iterations  of  the  algorithm,  a  sequence  {7^,^:  =  1,2,...},  i.e.,  the 
(1  —  /9)-quantiles  with  respect  to  the  sequence  of  p.d.f’s  {f{-,0k)},  are  calculated  at  step 
1  of  MRASq.  These  quantile  values  are  then  used  in  step  2  to  construct  a  sequence  of 
non-decreasing  thresholds  {7^,  k  =  1,2,...};  and  only  those  candidate  solutions  that  have 
performances  better  than  these  thresholds  will  be  used  in  parameter  updating  (cf.  equa¬ 
tion  (5.3)).  As  we  will  see,  the  theoretical  convergence  of  MRASq  is  unaffected  by  the  value 
of  the  parameter  p.  The  purpose  of  p  in  our  approach  is  to  concentrate  the  computational 
effort  on  the  set  of  elite/promising  samples,  which  is  a  standard  technique  employed  in 
most  of  the  population-based  approaches,  like  GAs  and  ED  As. 

During  the  initialization  step  of  MRASq,  a  small  number  e  and  a  strictly  increasing 
function  S{-)  :  JR  ^  51?^  are  also  specified.  The  function  S{-)  is  used  to  preserve  the 
correct  performance  order  among  candidate  solutions  and  to  account  for  the  cases  where 
the  values  of  H{x)  are  negative  for  some  x,  and  the  parameter  e  ensures  that  each  strict 
increment  in  the  sequence  {7^}  is  lower  bounded,  i.e., 

inf  (7fc+i  -  7fc)  >  £. 

/c  =  l,2,... 

We  require  e  to  be  strictly  positive  for  continuous  problems,  and  non-negative  for  discrete 
problems. 

In  continuous  domains,  the  division  by  f{x,6k)  in  the  performance  function  in  step 
3  is  well  defined  if  f{x,0k)  has  infinite  support  (e.g.  normal  p.d.f.),  whereas  in  dis¬ 
crete/combinatorial  domains,  the  division  is  still  valid  as  long  as  each  point  x  in  the  so¬ 
lution  space  has  a  positive  probability  of  being  sampled.  Additional  regularity  conditions 
on  f{x,0k)  in  Section  5.5  will  ensure  that  step  3  of  MRASq  can  be  used  interchangeably 
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with  the  following  equation: 


Bk+i  =  argmax  /  [S{H{x))]’^  fix,0)dx. 

eee  Jx&x 

We  now  show  that  there  is  a  sequence  of  reference  models  {gk{'),  A:  =  1,2,...}  im¬ 
plicit  in  MRASo,  and  the  parameter  computed  at  step  3  indeed  minimizes  the  KL- 
divergence  'D{gk+i,  /(•,  6»)). 


Lemma  5.3.1  The  parameter  9k+i  eomputed  at  the  kth  iteration  of  the  MRASq  algorithm 
minimizes  the  KL-divergenee  V  {g^+i,  f{-,6)),  where 

gk+i{x)  :=  k  =  l,...,  andgi(x)  := 

^gk  [^{^{^))HHiX)>yk+l}\ 

Proof:  For  brevity,  define  Sk{H{x))  :=  ■  We  have 

T r  rr/„\\-  i 

gi{x)  = 


'H{x)>'yi} 

[  f{X,9o)  \ 

h 

[Hix)>'ri} 

^{H{x)>'fi} 

^00 

®iy 

^eo 

'So{H{X))I^H^x)>^^y 

When  A:  >  1,  we  have  from  the  definition  of  gk{-)  above, 

S{H{x))I{H{x)>-f2}9iix) 


92{x)  = 


[S{H{X))I^Hix)>^2}\ 

iS  ( iiA  ( X ) ) /| /^  (^  ^  >  ^2 } -A|  ( X )  >  71 } 


Ee^  pi(l/(X))I|j7(x)>72}-A{H(x)>7i} 

S{H{x))I^}{(^x)>^2} 


Eg, 


S,{H{X))I{Hix)>^2} 


where  the  last  equality  follows  from  the  fact  that  the  sequence  {7^,,  A;  =  1,  2, . . .}  is  non¬ 
decreasing.  Proceeding  iteratively,  it  is  easy  to  see  that 

.  1  [smx)ti{Hix)>,,^,}  ^ _ 

gk+i{x)  = - ^ i V  A:  =  0, 1, . 


Eg 


Sk{H{X))I^Hix)>,,^.} 


Thus,  the  KL-divergence  between  gk+i{-)  and  f{-,0)  can  be  written  as 


V{gk+uf{;0))  =  Eg,^,  [lngk+i{X)]  -  Eg,^,  [ln/(A,l 


=  Eg^^^  [In gk+i{X)]- 


Sk{H{X))I^Hix)>^,^,}lnf{X,9) 


E, 


dk 


5fc(F(A))/|H(x)>7.7i} 


,  V  k. 
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The  result  follows  by  observing  that  minimizing  T)  {gk+i-,  fi',0))  with  respect  to  0  is  equiv¬ 


alent  to  maximizing  the  quantity  Eq^ 


5.3.2  Global  Convergence 

Obviously,  the  convergence  of  the  MRASq  algorithm  cannot  be  guaranteed  for  an 
arbitrary  parameterized  distribution  family.  For  example,  if  the  parameterized  family  is 
a  singleton  set,  (i.e.,  contains  only  one  distribution),  then  there  is  in  general  no  way  to 
ensure  the  convergence  of  the  algorithm.  Another  practical  concern  is  that  for  an  arbitrary 
parameterized  family,  the  computation  of  the  new  parameter  9k+i  in  (5.3)  may  not  even 
be  tractable.  These  suggest  that  we  should  restrict  our  analysis  and  discussions  to  families 
of  distributions  that  exhibit  some  structural  properties.  Now  we  show  that  for  a  particular 
parameterized  family  called  the  natural  exponential  family  (NEF),  the  global  convergence 
of  the  algorithm  can  be  established  and  the  new  parameter  6k+i  can  actually  be  obtained 
analytically.  We  start  by  stating  the  definition  of  NEF  and  some  regularity  conditions. 

Definition  5.3.1  A  parameterized  family  of  p.d.f’s/p.m.f’s  {f{-,6),  0  G  0  C  JR’”}  on  X 
is  said  to  belong  to  the  natural  exponential  family  (NEF)  if  there  exist  funetions  h{-)  : 
^  in,  F(-)  :  and  K{-)  :  ^  X  sueh  that 

/(x,  0)  =  exp  {0^F(x)  —  1^(0)} /i(x),  V0G0,  (5.4) 

where  K{9)  is  a  normalization  eonstant,  given  by  K{9)  =  In  J^^^exp  |0^F(x)}  h{x)v{dx), 
and  the  superseript  ‘T”  denotes  the  veetor  transposition.  For  the  ease  where  f{-,9)  is  a 
p.d.f.,  we  assume  thatTf)  is  a  eontinuous  mapping. 

The  NEF  covers  a  broad  class  of  distributions  like  Gaussian,  exponential,  Poisson,  bino¬ 
mial,  geometric,  and  certain  multivariate  forms  of  them. 
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Assumptions: 


Al.  For  any  given  constant  ^  <  H{x*),  the  set  {x  :  H{x)  >  ^}  n  A  has  a  strictly  positive 
Lebesgue  or  discrete  measure. 

A2.  For  any  given  constant  (5  >  0,  H{x)  <  H{x*),  where  :=  {x  :  ||x  —  x*  ||  >  (5}n 

X ,  and  we  use  the  convention  that  the  supremum  over  the  empty  set  to  be  — oo. 

A3.  There  exists  a  compact  set  11  such  that  the  level  set  {x  :  F[{x)  >  71}  n  A  C  n,  where 
71  =  sup;{Z  :  Pog{F[{X)  >  1)  >  p}  is  defined  as  in  the  MRASq  algorithm. 

A4.  The  maximizer  of  equation  (5.3)  is  an  interior  point  of  Q  for  all  k. 

A5.  sup5)g0  II  exp{0'^r(x)}r(x)/i(x)||  is  integrable/summable  with  respect  to  x,  where  6, 
r(-),  and  h{-)  are  defined  as  in  Definition  5.3.1. 

Intuitively,  Al  ensures  that  any  neighborhood  of  the  optimal  solution  x*  will  have 
a  positive  probability  of  being  sampled.  For  ease  of  exposition,  Al  restricts  the  class 
of  problems  under  consideration  to  either  continuous  or  discrete  problems;  however,  we 
remark  that  this  work  can  be  easily  extended  to  problems  with  mixture  of  both  continuous 
and  discrete  variables.  Since  H{-)  has  a  unique  global  optimizer,  A2  is  satisfied  by  many 
functions  encountered  in  practice.  Note  that  both  Al  and  A2  hold  trivially  when  A  is 
(discrete)  finite  and  the  counting  measure  is  used.  Assumption  A3  restricts  the  search  of 
the  MRASo  algorithm  to  some  compact  set;  it  is  satisfied  if  the  function  H{-)  has  compact 
level  sets  or  the  solution  space  A  is  compact.  In  actual  implementation  of  the  algorithm, 
step  3  of  MRASo  is  often  posed  as  an  unconstrained  optimization  problem,  i.e.,  0  = 
in  which  case  A4  is  automatically  satisfied.  It  is  also  easy  to  verify  that  A5  is  satisfied  by 
most  NEFs. 
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To  show  the  convergence  of  MRASq,  we  will  need  the  following  key  observation. 


Lemma  5.3.2  If  assumptions  A3—A5  hold,  then  we  have 

Ee,^^[nx)]=Eg^^Ar{x)],  yk  =  0,l,..., 

where  and  denote  the  expeetations  taken  with  respeet  to  f{-,9k+i)  and 

gk+i{-),  respeetively. 

Proof:  Define  Jfc((9, 7fc+i)  :=  ^^[S{H{x))\^  f{x,e)v{dx).  Since  f{-,  9) 

belongs  to  the  NEF,  we  can  write 

Jfc(6»,  7fc+i)  =  f  [S{H{x))f 

J  X 

+  [  [S{H{x))]^  I{H{x)>-ik+i}d^'^{xMdx) 

Jx 

-  [  [SiE{x))f  I{H{x)>^k+i}^^  [  exp{9'^T{x))  h{x)u{dx)  u{dx). 
Thus  the  gradient  of  Jk{9,^k+i)  with  respect  to  9  can  be  expressed  as 
VeJfc(6»,7fc+i)  =  f  [S{H{x))f  I{H{x)>^k+i}^i^)^idx) 

J  X 

J;,e<^"^^-)T{x)h{x)v{dx)  f 


e^^^^^lh{x)i'{dx)  Jx 


[S{H{x))]  I{Hix)>-f^:+i}’^idx), 


where  the  validity  of  the  interchange  of  derivative  and  integral  above  is  guaranteed  by 
assumptions  A5  and  the  dominated  convergence  theorem;  see  e.g.,  [69]  for  further  details. 

By  A3  and  the  non-decreasing  property  of  the  sequence  {7a:},  it  turns  out  that  the 
gradient  V0Jk{9,^k+i)  is  finite  and  thus  well-defined.  Moreover,  since  p  >  0,  the  set 
{x  :  H{x)  >  7fc+i}  n  X  will  have  a  strictly  positive  Lebesgue/counting  measure.  It  follows 
that  we  must  have  [S{H{x))]^  I^H(^x)>^k+i}^^dx)  >  0. 

By  setting  Vg  Jfc(0, 7fc+i)  =  0,  it  immediately  follows  that 


f  [S{H{x))fl{Hix)>M^{x) 

lx  fx[S{H{x))]’^I{H{x)>^k+i}J^{dx) 


v{dx)  =  / 
Jk 


e^^^(*)/i(x)r(x) 


Af  e^^^^^lh{x)v{dx) 
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and  by  definitions  of  gk+i{-)  (cf.  proof  of  Lemma  5.3.1)  and  /(•,  9),  we  have 

=  Ee[r{X)].  (5.5) 

By  assumption  A4,  since  9^+1  is  the  optimal  solution  of  the  problem 

argmax  Jfc((9,7fc+i), 
e 

it  must  satisfy  equation  (5.5).  Therefore  we  conclude  that 

Ea,AnX)]=Ee,^,[T{X)],  Vfc  =  0,l,.... 


We  have  the  following  convergence  result  for  the  MRASq  algorithm. 

Theorem  5.3.1  Let  {9k-,  A:  =  1, 2, . . .}  6e  the  sequenee  of  parameters  generated  by  MRASq. 
If  e  >  0  and  assumptions  A1—A5  are  satisfied,  then 

limi^,jr(A)]=r(x*),  (5.6) 

K— >oo 

where  the  limit  is  eomponent-wise. 

Remark  5.3.1  The  eonvergenee  result  in  Theorem  5.3.1  is  mueh  stronger  than  it  may 
appear  to  be.  For  example,  when  r(x)  is  a  one-to-one  funetion  (whieh  is  the  ease  for 
many  NEFs  used  in  praetiee),  the  eonvergenee  result  (5.6)  ean  be  equivalently  written 
as  T”^  (limfc^oo  Eg^  [r(A)])  =  x* .  Also  note  that  for  some  partieular  p.d.f.  ’s/p.m.f.  ’s,  the 
solution  veetor  X  itself  will  be  a  eomponent  ofT{x)  (e.g.,  multivariate  normal  distribution). 
Under  these  eireumstanees,  we  ean  interpret  (5.6)  as  lim^^oo  [A]  =  x* .  Another 
speeial  ease  of  partieular  interest  is  when  the  eomponents  of  the  random  veetor  X  = 
(Ai, . . .  ,Xn)  are  independent,  i.e.,  eaeh  has  a  univariate  p.d.f /p.m.f  of  the  form 

f{xi,6i)  =  exp(xidi  -  K(&i))h{xi),  G  3?,  Vi  =  1, . . . ,  n. 


109 


In  this  case,  since  the  distribution  of  the  random  vector  X  is  simply  the  product  of  the 
marginal  distributions,  we  will  clearly  have  r(x)  =  x.  Thus,  (5.6)  is  again  equivalent 
to  lirrifc^oo  [X]  =  x* ,  where  9^  '■=  (i?i,  •  •  •  ,"9^),  and  9^  is  the  value  of  Di  at  the  kth 
iteration. 


In  Lemma  5.3.2,  we  have  already  established  a  relationship  between  reference  models 
{dki')}  and  the  sequence  of  sampling  distributions  {f{-,6k)}.  Therefore,  proving  Theo¬ 
rem  5.3.1  amounts  to  showing  that  lim^^oo  [r(^)]  =  r(x*). 


Proof  of  Theorem  5.3.1:  Recall  from  Lemma  5.3.1  that 


gk+l{x):=  ln(rr(^..r - T  V  X  G  df ,  /c=l,2,, 


Thus 


(5.7) 


Since  7^  <  H{x*)  V/c,  and  each  strict  increment  in  the  sequence  {7^}  is  lower 
bounded  by  the  quantity  e  >  0,  there  exists  a  finite  M  such  that  7^+1  =  7fc,  V/c  >  M. 
Before  we  proceed  any  further,  we  need  to  distinguish  between  two  cases,  77/-  =  H{x*) 
and  77/-  <  H{x*). 


Case  1.  If  77/-  =  H{x*)  (note  that  since  p  >  0,  this  could  only  happen  when  the  solution 
space  is  discrete),  then  from  the  definition  of  gk+i{-)  (see  Lemma  5.3.1),  we  obviously  have 


gk+i{x)  =  0,  Vx/x*, 


and 


9k+i{x*) 


[S{H{x*))]>^I[Hix)=Hix*)} 

f^[S{H{x))]^I{Hix)=H{x*)}J^{dx) 


1  yk>Af. 


no 


Hence  it  follows  immediately  that 


Eg^^^[r{x)]=T{x*)  yk>Af. 


Case  2.  If  jf/  <  H{x*),  then  from  (5.7),  we  have 

[SmX))I{Hix)>^,^,}\  >  Eg,  [S{H{X))I{Hix)>^,^,}\  ,  Vfe  >  A7  -  1,  (5.8) 

i.e.,  the  sequence  [Eg,  [S{H{X))I{H{x)>^k+i}]  ,k  =  l,2,...}  converges. 

Now  we  show  that  the  limit  of  the  above  sequence  is  S{H{x*)).  To  do  so,  we  proceed 
by  contradiction  and  assume  that 


:=  ^lim  [S{H{X))I{Hix)>y,^^y]  <  S*  :=  S{H{x*)).  (5.9) 


Define  the  set  A  as 


A:= 


:=  {x  :  H{x)  >  Pi  |x  :  S{H{x)) 


Since  S{-)  is  strictly  increasing,  its  inverse  S  ^(•)  exists.  Thus  A  can  be  reformulated  as 


^  =  |x  :  H{x)  >  max  S  ^  ^ ^ — -'j  ||  n  df. 


And  since  'yj\f  <  E{x*),  A  has  a  strictly  positive  Lebesgue/discrete  measure  by  Al. 
Notice  that  g'fc(-)  can  be  rewritten  as 


k-l 


9k{x)  = 


+  l} 

Eg^  [S{H{X))I{h^x)>^^^,}] 


9i{x). 


Since  lim^^oo - /  ^  ^  _  (  1  b  {H(3:)>7Af}  >  i  \/  x  G  A.,  we  conclude  that 


lim  gk{x)  =  oo,  Vx  G  A,. 

Al— >oo 


Thus,  by  Fatou’s  lemma,  we  have 


l=liminf  /  (/fc(x)j^(dx)  >  liminf  /  gk{x)v{dx)  >  /  liminf  (/fc(x)j^(dx)  =  oo, 

Jx  Ja  Ja 
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which  is  a  contradiction.  Hence,  it  follows  that 


^lim  H,,  [S{H{X))I{Hix)>^,^,}]  =  S*.  (5.10) 

In  order  to  show  that  linifc^oo -Kg*,  [r(^)]  =  we  now  bound  the  difference 

between  Eg^,  [r(X)]  and  r(x*).  Note  that  V  A:  >  J\f,  we  have 

||H,,[r(x)]-r(x*)||  <  /  \\r{x)-r{x*)\\gk{xHdx) 

Jx 

=  j^\\^{x) -T{x*)\\gk{x)v{dx),  (5.11) 

where  C  :=  {x  :  H{x)  >  7^/"}  n  df  is  the  support  of  gk{-)-,  V  A  >  J\f. 

By  the  assumption  on  r(-)  in  Definition  5.3.1,  for  any  given  (^  >  0,  there  exists  a 
d  >  0  such  that  ||x— x*||  <  5  implies  ||r(x)— r(x*)||  <  C-  With  defined  from  assumption 
A2,  we  have  from  (5.11), 


||H,jr(A)]-r(x*)||  <  /  \\r{x)-r{x*)\\9k{xHdx) 

JA<=^nc 

+  [  \\^{x) -T{x*)\\gk{x)iy{dx) 

Jashc 

<  C  +  /  \\'^{x)  -T{x*)\\gk{x)u{dx),  yk>Af.  (5.12) 

JAsHC 

The  rest  of  the  proof  amounts  to  showing  that  the  second  term  in  (5.12)  is  also  bounded. 
Clearly  the  term  ||r(x)  —  r(x*)||  is  bounded  on  the  set  As  n  C.  We  only  need  to  find  a 
bound  for  gk{x). 

By  A2,  we  have 

sup  H{x)  <  sup  H{x)  <  H{x*). 

xgAsDC  x&As 

Define  Ss  ■=  S*  —  H{x)).  Since  S{-)  is  strictly  increasing,  we  have  Ss  >  0. 

Thus,  it  follows  that 

S{H{x))  <  S*  -  Ss,  yxeAsnC.  (5.13) 
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On  the  other  hand,  from  (5.8)  and  (5.10),  there  exists  M  >  M  such  that  V  k  >  M 


Eg,  [S{H{X))I^h(x)>^,^^}]  >S*-  -Ss. 


(5.14) 


Observe  that  gk{x)  can  be  alternatively  expressed  as 


fc-i 


'^('^(^))-^{i?(a;)>7i+i} 


axix),  yk>J\f. 


Thus,  it  follows  from  (5.13)  and  (5.14)  that 


gkix)  < 


S*  - 


k-jV 


S*-Ss/2j 


■  gj^{x),  VxGyl^nC,  \/k>M. 


Therefore, 


[r(x)] -r(x*)||  <  c+  sup  ||r(x) -r(x*)||  /  gk{x)v{dx) 

a;eA.,nC  J  Asrc 


<  C+  sup  ||r(x) -r(x*)||  ,  yk>M 

xeAsDC  -0&I2J 


=  (i+  sup  ||r(x) - r(x*)|| jc,  yk>Af, 

^  xeAsnc  ^ 

where  M  is  given  by  M  :=  max  {AA,  [AT  +  InC/ln  ( s*-S^l2 )  1  }• 

And  since  ()  is  arbitrary,  we  have 


hm  Eg,  [r(A)]=r(xA. 

K— >oo 


The  proof  is  completed  by  applying  Lemma  5.3.2  to  both  Case  1  and  Case  2. 


Remark  5.3.2  Note  that  for  problems  with  finite  solution  spaees,  assumptions  A1  and 
A2  are  automatieally  satisfied.  Furthermore,  if  we  take  the  input  parameter  e  =  0,  then 
step  2  of  MRASq  is  equivalent  to  fik+i  =  xAa,xi<i<k+ili-  Thus,  {fik}  is  non-deereasing 
and  eaeh  striet  inerement  in  the  sequenee  is  bounded  from  below  by 

min  \H(x)  —  H(y)\. 

Therefore,  the  e  >  0  assumption  in  Theorem  5.3.1  ean  be  relaxed  to  e>t). 
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We  now  address  some  of  the  special  cases  discussed  in  Remark  5.3.1. 


Corollary  5.3.2  (Multivariate  Normal)  For  continuous  optimization  problems  in 
if  multivariate  normal  p.d.f.  ’s  are  used  in  MRASq,  i.e., 

=  ^=4==exp  ( )  ,  (5.15) 

y'[2TTp\hk\  \  ^  J 

where  9k  :=  (nk]  ^k),  e  >  0,  and  assumptions  ^1—^4  are  satisfied,  then 

lim  Hk  =  X* ,  and  lim  =  Onxn, 

/c— >00  /c— >oo 

where  0„xn  represents  an  n-by-n  zero  matrix. 

Proof:  By  Lemma  5.3.2,  it  is  easy  to  show  that 

Mfc+i  k  0, 1, ... , 

and 

Sfc+i  =  Llgfe+i  [(-^  —  Hk+i){X  —  Hk+if'\  ,  V/c  =  0, 1, - 

The  rest  of  the  proof  amounts  to  showing  that 

lim  Eg^{X)  =  X*,  and  lim  Eg^  UX  -  fik){X  -  fik)'^]  =  Onxn, 

/c— >oo  /c— >oo 

which  is  the  same  as  the  proof  of  Theorem  5.3.1.  I 


Remark  5.3.3  Corollary  5.3.2  shows  that  in  the  multivariate  normal  ease,  the  sequence 
of  parameterized  p.d.f. ’s  will  converge  to  a  degenerate  p.d.f.  concentrated  only  on  the 
optimal  solution.  In  this  case  the  parameters  are  updated  as 

Ei,J{[S(if(X))]V/{X, ’ 

and 


Eg^  [{[S{H{X))f/f{X,  ek)}I{Hix)>^,^,}{X  -  tik+i){X  -  ^^k+lV] 

Ee,  [{[5(R(X))]V/(X,4)}/{h(x)>7.+r] 


114 


where  f{x,6k)  is  given  by  (5.15).  Note  that  when  the  solution  spaee  X  is  a  (simple) 
eonstrained  region  in  3^”,  one  straightforward  approaeh  is  to  use  the  aeeeptanee-rejeetion 
method  (ef.  e.g.,  [51]).  And  it  is  easy  to  verify  that  the  parameter  updating  rules  remain 
the  same. 

Corollary  5.3.3  (Independent  Univariate)  If  the  eomponents  of  the  random  veetor 
X  =  (Xi, . . . ,  Xn)  are  independent,  eaeh  has  a  univariate  p.d.f./p.m.f.  of  the  form 

f{xi,di)  =  exp(xit?i  -  K{di))h{xi),  Di  e  i  =  1, . . .  ,n, 

e  >  0,  and  ^1—^5  are  satisfied,  then 

lim  E0^[X]  =  X* ,  where  9k  :=  {Di, 
fc— >00 

5.4  An  Alternative  View  of  the  Cross-Entropy  Method 

In  this  Chapter,  we  give  an  alternative  interpretation  of  the  CE  method  for  optimiza¬ 
tion  and  discuss  its  similarities  and  differences  with  the  MRASq  algorithm.  Specifically,  we 
show  that  the  CE  method  can  also  be  viewed  as  a  search  strategy  guided  by  a  sequence 
of  reference  models.  Erom  this  particular  point  of  view,  we  establish  some  important 
properties  of  the  CE  method. 

The  deterministic  version  of  the  CE  method  for  solving  (5.1)  can  be  summarized  as 
follows. 

Algorithm  CEq:  Deterministic  Version  of  the  CE  Method 

1.  Choose  the  initial  p.d.f./p.m.f.  f{-,0o),  9o  G  0.  Speeify  the  parameter  p  G  (0, 1]  and 
a  non- deereasing  funetion  ip{-)  :  ^  U  {0}.  Set  /c  =  0. 
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2.  Calculate  the  (1  —  p)-quantile  7fc+i  as 


7fc+i  :=  sup{/  :  Pe^{H{X)  >l)>p]. 

3.  Compute  the  new  parameter 

Ok+i  :=  argmaxEe,^  In /(X,  6»)]  . 

6»e0 

4-  If  a  specified  stopping  rule  is  satisfied,  then  terminate;  otherwise  set  k  =  k  +  1  and 
go  to  Step  2. 

In  CEq,  choosing  (p{H{x))  =  1  gives  the  standard  CE  method,  whereas  choosing  ip{H{x))  = 
H(x)  (if  H(x)  >0,  y  X  €  X)  gives  an  extended  version  of  the  standard  CE  method  (cf. 
e.g.,  [26]). 

One  resemblance  between  CE  and  MRASq  is  the  use  of  the  parameter  p  and  the 
(1  — /9)-quantile  in  both  algorithms.  However,  the  fundamental  difference  is  that  in  CE,  the 
problem  of  estimating  the  optimal  value  of  the  parameter  is  broken  down  into  a  sequence 
of  simple  estimation  problems,  in  which  the  parameter  p  assumes  a  crucial  role.  Since  a 
small  change  in  the  values  of  p  may  disturb  the  whole  estimation  process  and  affect  the 
quality  of  the  resulting  estimates,  the  convergence  of  CE  cannot  be  always  guaranteed 
unless  the  value  of  p  is  chosen  sufficiently  small  (cf.  [26],  [41];  also  Example  5.4.1  below), 
whereas  the  theoretical  convergence  of  MRASq  is  unaffected  by  the  parameter  p. 

The  following  lemma  provides  a  unified  view  of  MRAS  and  CE;  it  shows  that  by 
appropriately  defining  a  sequence  of  implicit  reference  models  {g'jf{-)  '■  k  =  1,2,...},  the 
CE  method  can  be  recovered,  and  the  parameter  updating  in  CE  is  guided  by  this  sequence 
of  models. 
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Lemma  5.4.1  The  parameter  9k+i  eomputed  at  the  kth  iteration  of  the  CEq  algorithm 


minimizes  the  KL-divergenee  V  {g'jffi-,f{--,9)^,  where 


Proof:  Similar  to  the  proof  of  Lemma  5.3.1. 


(5.18) 


The  key  observation  to  note  is  that  in  contrast  to  MRASq,  the  sequence  of  reference  models 
in  CE  depends  explicitly  on  the  family  of  parameterized  p.d.f’s/p.m.f’s  {f{-,9k)}  used. 
Since  is  obtained  by  tilting  /(•,  9k)  with  the  performance  function,  it  improves  the 

expected  performance  in  the  sense  that 


%‘+l 


Thus,  it  is  reasonable  to  expect  that  the  projection  of  g'tffi{-)  on  {/(•,0)  :  0  G  0}  (i.e., 
f{-,9k+i))  also  improves  the  expected  performance.  This  result  is  formalized  in  the  fol¬ 
lowing  theorem. 


Theorem  5.4.1  For  the  CEq  algorithm,  we  have 

Ee^+i  [v^(^(^))-^{^^■(x)>7fe+l}]  >  Ee^  [ip{H(X))^H(^x)>'ik+i}\  ’  V  A:  =  0, 1, ... . 


Proof:  Define 


9k+i) 

Eok+i  [T{H{X))I^H(^x)>'rk+i}. 


We  have  from  the  definition  of  g'^^_^_i{-), 


In 


gr+i(^) 


\  f{X,9k)  ' 

f{X,9k+i)_ 


7  Edk+1  [T{H{X))I{H{x)>'fk+i}] 
Eok  [‘P{H{X))I^H(^x)>'rk+i}] 
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Since  9k+i  minimizes  the  K-L  divergence  f{-,9))  (cf.  Lemma  5.4.1),  it  follows  that 


0  <  p(5r+i,/(-,^fc))- 


<  p(5r+i,/(-,4))  -  w+i,/(-,0fc+i)) 


=  E, 


9k+i 


In 


f{X,9k+i 


+  E, 


f{X,9k)  \  "  [  f{X,9k+i)\ 


In 


f{X,9k) 


+  In 


=  In 


^efc+i  Vp{E{X))Ii^H{x)>^k+i}\ 
^dk  [^(.^iX))I{H{X)>'fk+i}] 


Therefore 


Eek+i[v^{H{X))I{H{x)>'yk+i}]  ^  ^dk[^i^iX))I{H{x)>'fk+i}]- 


In  the  standard  CE  method,  Theorem  5.4.1  implies  the  monotonicity  of  the  sequence 
{7fc  :  A:  =  1,2, . . .}. 

Lemma  5.4.2  For  the  standard  CE  method  (i.e.,  CEq  with  if{H{x))  =  1),  we  have 

7fc+2  >  7fc+i)  VA;  =  0, 1, - 

Proof:  By  Theorem  5.4.1,  we  have 

^0fc+i[^{/f(x)>7fc+i}]  >  EeAhH(x)>^k+i}\^ 

i.e., 

Pe,UH{X)  >  7fc+i)  >  Pe,{H{X)  >  ^u+i)  >  P- 
The  result  follows  by  the  definition  of  7^+2  (See  Step  2  of  the  CEq  algorithm).  I 

Note  that  since  7^  <  H{x*)  for  all  k,  Lemma  5.4.2  implies  that  the  sequence  {7^  :  A;  =  1, . . .} 
generated  by  the  standard  CE  method  converges.  However,  depending  on  the  p.d.f’s/p.m.f’s 


118 


and  the  parameter  p  used,  the  sequence  {7^}  may  not  converge  to  H{x*)  or  even  to  a  small 
neighborhood  of  H{x*)  (cf.  Examples  4.1  and  4.2  below). 

Similar  to  MRASq  (cf.  Lemma  5.3.2),  when  /(•,  9)  belongs  to  the  natural  exponential 
families,  the  following  lemma  relates  the  sequence  {/(•,  0fc),  A:  =  1, 2, . . .}  to  the  sequence 
of  reference  models  :  A  =  1, 2, . . .}. 

Lemma  5.4.3  Assume  that: 

1.  There  exists  a  eompaet  set  H  sueh  that  the  level  set  {x  :  H{x)  >  7^,}  n  A  C  H  for 

all  k  =  1,2, ,  where  jk  =  sup;{/  :  >l)>p}  is  defined  as  in  the  CEq 

algorithm. 

2.  The  parameter  6*^+1  eomputed  at  step  3  of  the  CEq  algorithm  is  an  interior  point  of 
0  for  all  k. 

3.  Assumptions  45  is  satisfied. 

Then 

[r(^)]  =  [r(X)] ,  V  A  =  0, 1, . . . . 

The  above  lemma  indicates  that  the  behavior  of  the  sequence  of  p.d.f’s/p.m.f’s  {/(•,  6k)} 
is  closely  related  to  the  properties  of  the  sequence  of  reference  models.  To  understand 
this,  consider  the  particular  case  where  r(x)  =  x.  If  the  CE  method  converges  to 
the  optimal  solution  in  the  sense  that  lim^^oo  [Lf(A)]  =  H{x*),  then  we  must  have 
Yweik^cxiEefiX]  =  x* ,  since  H{x)  <  H{x*)  Vx  /  x* .  Thus,  by  Lemma  5.4.3,  a  necessary 
condition  for  this  convergence  is  lim^^oo  Egce[X]  =  x* .  However,  unlike  MRASq,  where 
the  convergence  of  the  sequence  of  reference  models  to  an  optimal  degenerate  distribution 
is  guaranteed,  the  convergence  of  the  sequence  {(?“(•)  :  k  =  1,2, . . .}  relies  on  the  choices 
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of  the  families  of  distributions  {/(•,  0)}  and  the  values  of  the  parameter  p  used  (cf.  (5.18)). 
We  now  illustrate  this  issue  by  two  simple  examples. 

Example  5.4.1  (The  Standard  CE  Method)  Consider  maximizing  the  function  H{x) 
given  by 


0  XG{(0,1),(1,0)}, 

H{x)  =  <(  1  ^  =  (0^0), 

a  x  =  (l,l), 

where  a  >  1,  and  x  :=  (xi,  X2)  £  X  :=  {(0,  0),  (0, 1),  (1,0),  (1, 1)}. 
If  we  take  0.25  <  p  <  0.5  and  an  initial  p.m.f. 


(5.19) 


f{x,9o)=ptf^{l-pQf  qof  with  00  =  (po,  go)  =  (0.5,  0.5), 


gfix)  =  I 


then  since  P0q{x  G  {(0,0),  (1, 1)})  =  0.5  >  p,  we  have  71  =  1.  It  is  also  straightforward  to 
see  that 

r 

0.5  X  =  (0,0)  or  (1,1), 

0  otherwise, 

and  the  parameter  61  computed  at  step  3  (with  ip{H{x))  =  1)  of  CEq  is  given  by  9i  = 
(0.5, 0.5).  Proceeding  iteratively,  we  have  7^  =  1  and  g‘^{x)  =  ^“(x)  Vfc  =  1,2,...,  i.e., 
the  algorithm  does  not  converge  to  a  degenerate  distribution  at  the  optimal  solution. 

On  the  other  hand,  if  we  choose  p  <  0.25,  then  it  turns  out  that  'jk  =  0,  and 


g^ix)  =  { 


1  x  =  (l,l), 
0  otherwise. 


for  all  k  =  1,2,...,  which  means  the  algorithm  converges  to  the  optimum. 


Example  5.4.2  (The  Extended  Version  of  the  CE  Method)  Consider  solving  prob¬ 
lem  (5.19)  by  CEq  with  the  performance  function  (p{H{x))  =  H{x).  We  use  the  same 
family  of  p.m.f’s  as  in  Example  5.4.1  with  the  initial  parameter  6q  =  (j^,  j^)-  If  the 
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(  1 

a2+l  ^ 

\  (l+«) 

2  ’  (l+a)2  j  ’ 

a 

X  =  (0,0), 

1 

X  =  (1,1), 

0 

otherwise. 

values  of  p  are  chosen  from  the  interval  ^  ’  (i+ap )  ’  then  we  have  6k  =  T^)’ 

7fc  =  1,  and 


g^ix)  = 


for  all  k  =  1,2, . . .. 

On  the  other  hand,  if  we  choose  p  =  0.5  and  Oq  =  (0.5,  0.5),  then  it  is  easy  to  verify 
that  lim^^oo  7fc  =  fl  and 

/ 

1  x  =  (l,l), 

lim  5rg^(x)  = 

/c— >00 

0  otherwise. 

5.5  The  MRASi  Algorithm  (Monte  Carlo  Version) 


The  MRASo  algorithm  describes  the  idealized  situation  where  quantile  values  and 
expectations  can  be  evaluated  exactly.  In  practice,  we  will  usually  resort  to  its  stochastic 
counterpart,  where  only  a  finite  number  of  samples  are  used  and  expected  values  are 
replaced  with  their  corresponding  sample  averages.  For  example,  step  3  of  MRASq  will 
be  replaced  with 


N 


4+1  =  argmax  ^  ^  In  f{Xi,  6), 


(5.20) 


0&e  /(Ai,4) 

where  Xi, . . . ,  X^  are  i.i.d.  random  samples  generated  from  /(x,  9k),  6k  is  the  estimated 
parameter  vector  computed  at  the  previous  iteration,  and  4+1  is  a  threshold  determined 
by  the  sample  (1  —  /9)-quantile  of  H{Xi), . . .  ,H[Xm)- 

However,  the  theoretical  convergence  can  no  longer  be  guaranteed  for  a  simple  sto¬ 
chastic  counterpart  of  MRASq.  In  particular,  the  set  {x  :  H{x)  >  7^+1}  involved  in  (5.20) 
may  be  empty,  since  all  the  random  samples  generated  at  the  current  iteration  may  be 
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much  worse  than  those  generated  at  the  previous  iteration.  Thus,  we  can  only  expect  the 
algorithm  to  converge  if  the  expected  values  in  the  MRASq  algorithm  are  closely  approxi¬ 
mated.  Obviously,  the  quality  of  the  approximation  will  depend  on  the  number  of  samples 
to  be  used  in  the  simulation,  but  it  is  difficult  to  determine  in  advance  the  appropriate 
number  of  samples.  A  sample  size  too  small  will  cause  the  algorithm  to  fail  to  converge 
and  result  in  poor  quality  solutions,  whereas  a  sample  size  too  large  may  lead  to  high 
computational  cost. 

As  mentioned  earlier,  the  parameter  p,  to  some  extent,  will  affect  the  performance 
of  the  algorithm.  Large  values  of  p  mean  that  almost  all  samples  generated,  regardless  of 
their  performances,  will  be  used  to  update  the  probabilistic  model,  which  could  slow  down 
the  convergence  process.  On  the  other  hand,  since  a  good  estimate  will  necessarily  require 
a  reasonable  amount  of  valid  samples,  the  quantity  pN  (i.e.,  the  approximate  amount  of 
samples  that  will  be  used  in  parameter  updating)  cannot  be  too  small.  Thus,  small  values 
of  p  will  require  a  large  number  of  samples  to  be  generated  at  each  iteration  and  may 
result  in  significant  simulation  efforts.  For  a  given  problem,  although  it  is  clear  that  we 
should  avoid  those  values  of  p  that  are  either  too  close  to  1  or  too  close  to  0,  to  determine 
a  priori  which  p  gives  a  satisfactory  performance  may  be  difficult. 

In  order  to  address  the  above  difficulties,  we  adopt  the  same  idea  as  in  [41]  and 
propose  a  modified  Monte  Carlo  version  of  MRASq  in  which  the  sample  size  N  is  adaptively 
increasing  and  the  parameter  p  is  adaptively  decreasing. 

5.5.1  Algorithm  Description 

Roughly  speaking,  the  MRASi  algorithm  is  essentially  a  Monte  Carlo  version  of 
MRASq  except  that  the  parameter  p  and  the  sample  size  N  may  change  from  one  iteration 
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Algorithm  MRASi  Monte  Carlo  version 


•  Initialization:  Specify  po  G  (0;  !]>  an  initial  sample  size  Aq  >  1,  e  >  0,  a  >  1,  a  mixing 
coefficient  A  G  (0,1],  a  strictly  increasing  function  S{-)  :  3?  ^  and  an  initial  p.d.f. 
f{x,  9o)  >  0  V  a;  G  A.  Set  0o  ^  9o,  k  ^  0. 

•  Repeat  until  a  specified  stopping  rule  is  satisfied: 

1.  Generate  Nk  i.i.d.  samples  X^, . . .  according  to  f{-,6k)  ■=  (1  —  X)f{-,9k)  + 
A/(-,0o). 


2.  Compute  the  sample  (1  —  pfc)-quantile  jk+i{pk,  Nk)  :=  ^([(i-pfc)Affc]))  where  [a]  is 
the  smallest  integer  greater  than  a,  and  is  the  ith  order  statistic  of  the  sequence 
{A(Af),  i  =  l,...,Nk}. 


3.  If  A:  =  0  or  7fc+i(pfc,  A^)  >  7fc  +  |,  then 

3a.  Set  7fc+i  ^  ^k+i{Pki  Nk)^  Pk+i  ^  Pk:  Nk-^i  ^  Nk- 
else,  find  the  largest  p  G  (0,pfc)  such  that  7fc+i(p,  A^)  >  7fe  +  |. 

3b.  If  such  a  p  exists,  then  set  7fc+i  ^  7fc+i(p,  A^),  pk+i  ^  p,  Nk+i  ^  Nk- 
3c.  else  (if  no  such  p  exists),  set  7fc+i  ^  7fc,  pfc+i  ^  pfe,  A^+i  ^  [aAfc]. 
endif 

4.  Compute  9k+i  as 


9k+i  =  argmax  — 
eee  -'''/c 


Nk 

E 


[s{H{xm^ 

f{Xt0k) 


(5.21) 


5.  Set  k  ^  k  +  1. 


to  another.  The  rate  of  increase  in  the  sample  size  is  controlled  by  an  extra  parameter 
a  >  1,  specified  during  the  initialization  step.  For  example,  if  the  initial  sample  size  is 
Aq,  then  after  k  increments,  the  sample  size  will  be  approximately  [a^Ao]. 
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At  each  iteration  k,  random  samples  are  drawn  from  the  density /mass  function 
f{-,0k),  which  is  a  mixture  of  the  initial  density/mass  f{-,9o)  and  the  density/mass  cal¬ 
culated  from  the  previous  iteration  f{-,6k)  (cf.  e.g.,  [9]  for  a  similar  idea  in  the  context 
of  multiarmed  bandit  models).  We  assume  that  f{-,9o)  satisfies  the  following  condition: 

Assumption  A3'.  There  exists  a  eompaet  set  Ilg  sueh  that  {x  :  H{x)  >  H{x*)  —  e}n  A  C 
lie.  Moreover,  the  initial  density/mass  funetion  f{x,9o)  is  bounded  away  from  zero  on 
Ue,  i.e.,  /*  :=  infa;gn^  f{x,9o)  >  0. 

In  practice,  the  initial  density  /(•,  9q)  can  be  chosen  according  to  some  prior  knowledge  of 
the  problem  structure;  however,  if  nothing  is  known  about  where  the  good  solutions  are, 
this  density  should  be  chosen  in  such  a  way  that  each  region  in  the  solution  space  will  have 
an  (approximately)  equal  probability  of  being  sampled.  For  instance,  when  X  is  finite, 
one  simple  choice  of  /(-j^o)  is  the  uniform  distribution.  Intuitively,  mixing  in  the  initial 
density  forces  the  algorithm  to  explore  the  entire  solution  space  and  to  maintain  a  global 
perspective  during  the  search  process.  Also  note  that  if  A  =  1,  then  random  samples  will 
always  be  drawn  from  the  initial  density,  in  which  case,  MRASi  becomes  a  pure  random 
sampling  approach. 

At  step  2,  the  sample  (1  — /3fc)-quantile  7fc+i  is  calculated  by  first  ordering  the  sample 
performances  H{Xf),  i  =  1,. . .  ,Nk  from  smallest  to  largest,  //(i)  <  H(^2)  <  •  •  •  < 
and  then  taking  the  [(1  —  pk)Nk]t]i  order  statistic.  We  use  the  function  ^k+i{pk,  Tlk)  to 
emphasize  the  dependencies  of  7fc+i  on  both  pk  and  Nk,  so  that  different  sample  quantile 
values  used  during  one  iteration  can  be  distinguished  by  their  arguments. 

Step  3  of  MRASi  is  used  to  extract  a  sequence  of  non-decreasing  thresholds  {7^,  k  = 
1,2...}  from  the  sequence  of  sample  quantiles  {7*;},  and  to  determine  the  appropriate 
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values  of  pk+i  and  N^+i  to  be  used  in  subsequent  iterations.  This  step  is  carried  out  as 
follows.  At  each  iteration  k,  we  first  check  whether  the  inequality  jk+iiPk^  A^fc)  ^ 
is  satisfied,  where  7^  is  the  threshold  value  used  in  the  previous  iteration.  If  the  inequality 
holds,  then  it  means  that  both  the  current  pk  value  and  the  current  sample  size  Nk  are 
satisfactory;  thus  we  proceed  to  step  3a  and  update  the  parameter  vector  9k+i  in  step  4 
by  using  7^+1  (p^,  A^^).  Otherwise,  it  indicates  that  either  pk  is  too  large  or  the  sample 
size  Nk  is  too  small.  To  determine  which,  we  fix  the  sample  size  Nk  and  check  if  there 
exists  a  smaller  p  <  pk  such  that  the  above  inequality  can  be  satisfied  with  the  new  sample 
(1  —  p)-quantile.  If  such  a  p  does  exist,  then  the  current  sample  size  Nk  is  still  deemed 
acceptable,  and  we  only  need  to  decrease  the  pk  value.  Accordingly,  the  parameter  vector 
is  updated  in  step  4  by  using  the  sample  (1  —  p)-quantile.  On  the  other  hand,  if  no  such  p 
can  be  found,  then  the  parameter  vector  is  updated  by  using  the  threshold  7*,  calculated 
during  the  previous  iteration  and  the  sample  size  Nk  is  increased  by  a  factor  a. 

We  make  the  following  assumption  about  the  parameter  vector  9k+i  computed  at 
step  4: 

Assumption  A4h  The  parameter  vector  9k+i  computed  at  step  4  of  MRASi  is  an  interior 
point  of  0  for  all  k. 

It  is  important  to  note  that  the  set  {x  :  H{x)  >  7fc+i,  x  G  . . . ,  A^^}}  could  be  empty 
if  step  3c  is  visited.  If  this  happens,  the  right  hand  side  of  (5.21)  will  be  equal  to  zero,  so 
any  0  G  0  is  a  maximizer,  and  we  define  9k+i  '■=  9k  in  this  case. 

5.5.2  Global  Convergence 

In  this  Chapter,  we  discuss  the  convergence  properties  of  the  MRASi  algorithm  for 
natural  exponential  families  (NEFs).  To  be  specific,  we  will  explore  the  relations  between 
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MRASi  and  MRASq  and  show  that  with  high  probability,  the  gaps  (e.g.,  approximation 
errors  incurred  by  replacing  expected  values  with  sample  averages)  between  the  two  al¬ 
gorithms  can  be  made  small  enough  such  that  the  convergence  analysis  of  MRASi  can 
be  ascribed  to  the  convergence  analysis  of  the  MRASq  algorithm;  thus,  our  analysis  relies 
heavily  on  the  results  obtained  in  Chapter  5.3.2.  Throughout  this  Chapter,  we  denote 
by  and  Eg^[-]  the  respective  probability  and  expectation  taken  with  respect  to  the 

p.d.f./p.m.f.  f{-,0k),  and  Pg^i')  and  Eg^[-]  the  respective  probability  and  expectation 
taken  with  respect  to  f{-,9k)-  Note  that  since  the  sequence  {9k}  results  from  random 
samples  generated  at  each  iteration  of  MRASi,  these  quantities  are  also  random. 

Let  A:  =  0, 1, . . . ,  be  defined  by 


gk+i{x) 


gk{x) 


if  [x  :  H{x)  >  7fc+i,a:  e  A^}  /  0, 
otherwise, 

(5.22) 


where  A^  :=  {Xf , . . . ,  ]  is  the  population  of  candidate  solutions  generated  at  iteration 

lk+i{pk-,  Ek)  if  step  3a  is  visited, 

7fc+i(p,  Xfc)  if  step  36  is  visited, 

7fc  if  step  3c  is  visited. 

Similar  to  Lemma  5.3.2,  the  following  lemma  shows  the  connection  between  /(•,  9k+i) 


k,  and  7fc+i  is  given  by  7^+1  :=  < 


and  gk+i{-)- 


Lemma  5.5.1  If  assumptions  A4'  and  A5  hold,  then  the  parameter  9 k+i  eomputed  at  step 
3  of  MRASi  satisfies 


^4^,  [rmi  =  Ej,,.  [rmi ,  vt  =  o,i,..., 

Note  that  the  region  {x  :  H{x)  >  7fc+i}  will  become  smaller  and  smaller  as  7^+1 
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increases.  Lemma  5.5.1  shows  that  the  sequence  of  sampling  p.d.f’s/p.m.f’s  {f{-,6k+i)} 
is  adapted  to  this  sequence  of  shrinking  regions.  For  example,  consider  the  case  where 
{x  :  H{x)  >  7fc+i}  is  convex  and  r(x)  =  x.  Since  Eg^^^[X]  is  the  convex  combination  of 
Xf, . . .  the  lemma  implies  that  G  {x  :  H{x)  >  7fc+i}-  Thus,  it  is  natural 

to  expect  that  the  random  samples  generated  at  the  next  iteration  will  fall  in  the  region 
{x  :  H{x)  >  7fc+i}  with  large  probabilities  (e.g.,  consider  the  normal  p.d.f.  where  its  mode 
is  equal  to  its  mean).  In  contrast,  if  we  use  a  fixed  sampling  distribution  for  all  iterations 
as  in  pure  random  sampling  (i.e.,  the  A  =  1  case),  then  sampling  from  this  sequence  of 
shrinking  regions  could  become  a  substantially  difficult  problem  in  practice. 

Next,  we  present  a  useful  lemma,  which  shows  the  convergence  of  the  quantile 
estimates  when  random  samples  are  generated  from  a  sequence  of  different  distributions. 

Lemma  5.5.2  For  any  given  G  (0,1),  let  jj,  be  the  set  of  {1  —  p"^)- quantiles  of  H{X) 
with  respeet  to  the  p.d.f./p.m.f.  f{-,0k),  and  let  jl.{p^,Ni^)  be  the  eorresponding  sample 
quantile  of  H{X^), H{X^^),  where  f{-,9k)  o.nd  are  defined  as  in  MRASi,  and 
Xf,...,X^^  are  i.i.d.  with  eommon  density  f{-,9k)-  Then  the  distanee  from  jl.{p\  Nk) 
to  tends  to  zero  as  k  ^  oo  w.p.l. 

Proof:  Our  proof  is  based  on  the  proof  of  Lemma  Al  in  [69] .  Notice  that  for  given 
and  f{-,9k),  7]]  can  be  obtained  as  the  optimal  solution  of  the  following  problem  (cf.  [41]) 


mill  4  (u), 
vGV 


where  V  =  [0,iL(x*)],  ik{v)  '■=  Eg^(j){H {X) ,  v) ,  and 


(p{H{x),v) 


(1  —  p^){H{x)  —  v)  \i  V  <  H{x), 
p'l’(u  —  iL(x))  \iv>H{x). 


(5.23) 
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Similarly,  the  sample  quantile  ^],{p\Nk)  can  be  expressed  as  the  solution  to  the  sample 
average  approximation  of  (5.23), 

min£fc(u),  (5-24) 

dSV 

where  4(u)  :=  ^  Yfj=i  and  X^, . .  .,X^^  are  i.i.d.  with  density  f{-,0k). 

Since  the  function  (f){H{x),v)  is  bounded  and  continuous  on  V  for  all  x  G  df,  it  is 
not  difficult  to  show  that  ik{v)  is  continuous  on  V  (cf.  [69]). 

Now  consider  a  point  v  G  V  and  let  i?*  C  V  be  a  sequence  of  open  balls  containing 
V  such  that  Bj+i  ^  BiM  i  and  limi^oo  =  v.  Define  the  function 

bi{H{x))  :=  sup{|(/)(iJ(a;),u)  -  4){H{x),v)\  :  u  G  Bi}  . 

We  have  from  the  dominated  convergence  theorem 

Yi^E^[hi{H{X))]=E^[Yi^hi{H{X))]={)  Vfc  =  l,2,...,  (5.25) 

i^oo  Pfe  i^oo 

where  the  last  equality  follows  from  the  fact  that  (p{E[(x),v)  is  continuous  on  V. 

Since 

Nk 

|4(n)  -4(41  <^^\<P{H{X^),u)-<P{H{Xf),v)\, 

^  i=i 

it  follows  that 

Nk 

sup  14(4  -  h{v)\  <  ^  ^  bi{H{Xj)).  (5.26) 

u&Bi  Xk  ^ 

We  now  show  that  bi{H{Xj))  E^^[bi{H {X))]  as  k  ^  oo  w.p.l. 

Let  M  be  an  upperbound  for  bi{H{x)),  and  let  %  :=  ^  where  A4  is  a 

lower  bound  for  the  function  H(x),  and  e  is  defined  as  in  the  MRASi  algorithm.  Note 
that  the  total  number  of  visits  to  step  3a  and  36  of  MRASi  is  bounded  by  %,  thus  for  any 
k  >  7^,  the  total  number  of  visits  to  step  3c  is  greater  than  k  —  T^.  Since  conditional  on  0^, 
^  Tl!j=i  bi{H{Xb))  is  an  unbiased  estimate  of  E-^Jbi{H{X))]^  by  the  Hoeffding  inequality 
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([40]  ),  for  any  C  >  0; 


^  ^  j=i 


0k  =  0]  <  2  exp  (  1  V  k. 


Therefore 

Nk 


\  k 


<  2  exp 


<  2  exp 


Vfe, 


M2 

-2a^-^-iVoC^ 

M2 


yk>%, 


0  as  k  ^  oo,  since  a  >  1. 


Furthermore,  it  is  easy  to  see  that 

/  I  ~  \  oo  ^ _ n  k—Ti;  j\j  A‘i\ 

E  ^  I K  E  -  Ej. [i-.(if(X))l  I  >  c  <  2  5: exp  - <  oo, 

k=l  ^  ^  j=l  2  \  / 

By  the  Borel-Cantelli  lemma, 

/  1  ~  \ 

i.o.j  =0. 

\  k  / 

This  implies  that  E'^^[bi{H {X))]  as  A:  ^  oo  w.p.l.  Note  that  by 

using  a  similar  argument  as  above,  we  can  also  show  that  ik{v)  (-kiy)  w.p.l  as  k  ^  oo. 

The  above  result  together  with  (5.25)  and  (5.26)  implies  that  for  any  (5  >  0,  there 
exists  a  small  neighborhood  of  v  such  that 


sup{|£fc(ii)  —  £k{v)\  ■  u  G  By}  <  6  w.p.l  for  k  sufficiently  large. 


Since  this  holds  for  all  u  G  V,  we  have  V  C  Uy^^By,  and  because  V  is  compact,  there 
exists  a  finite  subcover  By.^ ,  •  •  • ,  By^  such  that 

sup{|Zfc(ii)  —  ikivj)\  ■  u  G  By.}  <  5  w.p.l  for  k  sufficiently  large,  and  V  C  UjLiBy.. 

Furthermore,  by  the  continuity  of  ik{v),  these  open  balls  can  be  chosen  in  such  a  way  that 

sup{|4(«)  -  4(^^i)|  :  uGBy.}<6  Vj  =  l,...,m. 
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Since  ik{vj)  ^k{vj)  w.p.l  as  k  ^  oo  for  all  j  =  1, . . . ,  m, 

\^k{vj)  —  ^kivj)\  <  S  w.p.l  for  k  sufficiently  large,  Vj  =  1, . . . , m. 

For  any  v  €V,  without  lost  of  generality  assume  v  G  we  have  w.p.l  for  k  sufficiently 
large 

I4(^')  -  44)1  <  144)  -  44i)l  + 14(^^)  -  ik{vj)\  +  |44i)  -  ^k{vj)\  <  3*5, 

which  implies  that  ik{v)  ik{v)  uniformly  w.p.l  on  V. 

The  rest  of  the  proof  follows  from  Theorem  yll  in  [69]  (pp.  69),  which  basically 
states  that  if  £k{v)  ik{v)  uniformly  w.p.l,  then  the  distance  from  7^(4)  ^k)  to  tends 
to  zero  w.p.l  as  k  ^  oo.  I 

We  are  now  ready  to  state  the  main  theorem. 

Theorem  5.5.1  Lets  >  0,  and  define  the  e-optimal  set  :=  {x  :  H{x)  >  H{x*)—£'\f^X . 
If  assumptions  Al,  A3' ,  AA' ,  and  T5  are  satisfied,  then  there  exists  a  random  variable  /C 
sueh  that  w.p.l.,  K,  <  oo,  and 

1.  jk>n{x*)  —  £,  y  k  >  JC 

2.  [r(X)]  G  CONV{r{Oe)}  ,  yk>JC,  where  CONV  {r{Oe)}  indieates  the  eon- 
vex  hull  of  the  set  T{Oe). 

Furthermore,  let  fi  be  a  positive  eonstant  satisfying  the  eondition  that  the  set  {x  :  S{H{x))  > 
has  a  strietly  positive  Lebesgue/eounting  measure.  If  assumptions  Al,  A2,  A3',  AA' , 
and  T5  are  all  satisfied  and  a  >  (/3S*)^,  where  S*  :=  S{H{x*)),  then 

3.  limk-,ocE^^  [r(X)]  =  r(x*)  w.p.l. 
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Remark  5.5.1  Roughly  speaking,  the  seeond  result  ean  he  understood  as  finite  time  e- 
optimality.  To  see  this,  eonsider  the  speeial  ease  where  H{x)  is  loeally  eoneave  on  the 
set  O^.  Let  x,y  G  Os  and  rj  G  [0,1]  be  arbitrary.  By  the  definition  of  eoneavity,  we  will 
have  H{rix  +  (1  —  r])y)  >  r]H{x)  +  (1  —  r])H{y)  >  H{x*)  —  s,  whieh  implies  that  the  set 
Os  is  eonvex.  If  in  addition  r(x)  is  also  eonvex  and  one-to-one  on  Os  (e.g.  multivariate 
normal  p.d.f),  then  CONV  {r(C>e)}  =  T{Os).  Thus  it  follows  that  [r(X)])  G 

Os,  V/c  >  /C  w.p.l. 

Proof  of  Theorem  5.5.1:  (1)  The  first  part  of  the  proof  is  an  extension  of  the  proofs 

given  in  [41],  First  we  claim  that  given  pk  and  fik,  if  7fc  <  H{x*)  —  e,  then  3^  <  oo  w.p.l 
and  p  G  {f),pk)  such  that  fik’+i{p,Nk’)  >  7fc  +  §  V/c'  >  ^.  To  show  this,  we  proceed  by 
contradiction. 

Let  pI  :=  {H{X)  >  7^  +  ^).  If  7fc  <  H{x*)  -  £,  then  7^  +  ^  <  H{x*)  -  |.  By 

A1  and  A3',  we  have 

Pi  >  \  (h{X)  >  H{x*)  -  I)  >  AC(£,  0o)  >  0,  (5.27) 

where  C(e,  ^o)  =  hH{x)>H{x*)-ep}f{x,  efi)n{dx)  is  a  constant. 

Now  assume  that  3  p  G  (0,^^)  such  that  7fc+i(p,  <  7fc  3-  where  7fc+i(p,  is 
the  (1  —  p)-quantile  of  H{X)  with  respect  to  f{-,0k).  By  the  definition  of  quantiles,  we 
have 


P^^(R(A)>7fc+i(p,4) 

^  >  p,  and 

%(R(A)<7fc+i(p,4) 

)  >  1  -  P  >  1  -  /5fc- 

(5.28) 

It  follows  that  (^H{X)  <7fc+i(p,  6*^)^  < 

4(»(V)  <  71  +  t) 

=  1  -  Pfc  by  tbe 

definition  of  pi,  which  contradicts  equation  (5.28);  thus  we  must  have  that  if  7^  <  H{x*)  — 
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e,  then 


7fc+i(/3,6lfc)  >  7fc  +  y,  VpG(0,/)^). 

Therefore  by  (5.27),  3^  G  (O,  min{/)fc,  AC(e,  ^o)})  C  (0,/9fc)  such  that  7fc+i(/0,  >  7fc  +  t 

whenever  7^  <  H{x*)  —  e.  By  Lemma  5.5.2,  the  distance  from  the  sample  (1  —  ;o)-quantile 
7fc+i(/i,  A^fc)  to  the  set  of  (1  —  /i)-quantiles  7fc_|_i(p,  0^)  goes  to  zero  as  k  ^  00  w.p.l,  thus 
3iC  <  00  w.p.l  such  that  ^k'+i{p,  >  7fc  +  §  V  /c'  >  ^. 

Notice  that  from  the  MRASi  algorithm,  if  neither  step  3a  nor  3b  is  visited  at  the 
/cth  iteration,  we  will  have  pk+i  =  Pk  and  7^+1  =  7fc-  Thus,  whenever  7^  <  H{x*)  —  e, 
w.p.l  step  3a/3b  will  be  visited  after  a  finite  number  of  iterations.  Furthermore,  since 
the  total  number  of  visits  to  steps  3a  and  3b  is  finite  (i.e.,  bounded  by  ^  where 

recall  that  is  a  lower  bound  for  H{x)),  we  conclude  that  there  exists  /C  <  00  w.p.l, 
such  that 

7fc>iL(x*)  — e,  y  k  >  JC  w.p.l. 

(2)  From  the  MRASi  algorithm,  it  is  easy  to  see  that  7^+1  >  7^,  V  A:  =  0, 1, . . ..  By 
part  (1),  we  have  jk+i  >  H{x*)  —  e,  M  k  >  K,  w.p.l.  Thus,  by  the  definition  of  gk+i{x) 
(cf.  (5.22)),  it  follows  immediately  that  if  |x  :  H{x)  >  7^+1,  x  G  /  0, 

then  the  support  of  'gk+i{x)  satisfies  supp{gk+i}  C  y  k  >  1C  w.p.l;  otherwise  if 
|x  :  H{x)  >  7fc+i,  X  G  . . . ,  A^^||  =  0,  then  .supp  {gk+i}  =  0-  We  now  discuss  these 
two  cases  separately. 

Case  1.  If  supp{gk+i}  ^  Oe,  then  we  have  {F(siipp{5fc+i})}  C  {F(C>e)}.  Since 
^9k+i  [^(^)]  the  convex  combination  of  F(Af), . . .  ,F(A^^),  it  follows  that 

[F(A)]  G  CONV{T{supp{gk+i})}  C  C0iVR{F(0,)}  . 
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Thus  by  AT,  A5,  and  Lemma  5.5.1 


E~^^^^[T{X)]eCONV{T{Oe)]. 

Case  2.  If  supp  {gu+i}  =  0  (note  that  this  could  only  happen  if  step  3c  is  visited),  then 
from  the  algorithm,  there  exists  some  k  <  k  +  1  such  that  y/c+i  =  7^  and  supp  {g'j^}  /  0. 
Without  loss  of  generality,  let  k  be  the  largest  iteration  counter  such  that  the  preceding 
properties  hold.  Since  7^  =  7^+1  >  H{x*)  —  e  M  k  >  K,  w.p.l,  we  have  C 

w.p.l.  By  following  the  discussions  in  Case  1,  it  is  clear  that 


[r(A)]  G  C0NV{T{0,)],  w.p.l. 

k 


Furthermore,  since  =  ...  =  (see  discussions  in  Chapter  5.5.1),  we  will 

again  have 

Eg^^^  [r(A)]  G  CONV{r{Oe)}  ,'ik>ic  w.p.l. 

(3)  Define  gk+i{x)  as 


gk+i{x)  := 


[5(iL(x))]^/|H(.)>y,} 


-,  VA:  =  1,2, 


where  7^  is  defined  as  in  MRASi.  Note  that  since  7^  is  a  random  variable,  gk+i{x)  is  also 
a  random  variable.  It  follows  that 


Let  uj  =  (A°, . . . ,  . . . ,  X\j^, . . .)  be  a  particular  sample  path  generated  by 

the  algorithm.  For  each  w,  the  sequence  {7^(0;),  A:  =  1,  2, . . .}  is  non-decreasing  and  each 
strict  increase  is  lower  bounded  by  e/2.  Thus,  3  AA(ti;)  >  0  such  that  7^+1  (a;)  =  V  A:  > 

M{uj).  Now  define  Di  :=  {cu  :  lim^^oo  7fc(‘^)  =  H{x*)}.  By  the  definition  olgk+i{-)  (cf. 
(5.22)),  for  each  w  G  Di  we  clearly  have  lim^^oo  [^(A)]  =  F(x*);  thus,  it  follows 
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from  Lemma  5.5.1  that  lim^^oo  [r(-^)]  =  r(x*),  Vw  G  fli.  The  rest  of  the  proof 

amounts  to  showing  that  the  result  also  holds  almost  surely  (a.s.)  on  the  set 

Since  lim^^oo  <  H{x*)  Vcu  G  Of,  we  have  by  Fatou’s  lemma 


lim  inf 

/c— >oo 


J  X 


Jx 

>  0,  Vo;  G  ni,  (5.29) 


where  the  last  inequality  follows  from  the  fact  that  (5S{H{x))  >  1  Vx  G  {x  :  H[x)  > 
maxis'”^ ( ^ ), 7_^}}  and  assumption  Al. 

Since  f{x,6o)  >  0  Vx  G  A,  we  have  X  C  supp{f{-,6k)}  V/c;  thus 

E  inxil  =  , 

where  Sk{H{x))  :=  [S{H{x))]’" /f{x,6k).  We  now  show  that  [r(d^)]  ^  [r(X)] 

a.s.  on  17^  as  A:  ^  oo.  Since  we  are  only  interested  in  the  limiting  behavior  of  Eg^^^  [r(A)], 
it  is  sufficient  to  show  that 

1  nkQ,  (  TT(  vk\\T ,  ^x(vk\ 


%+i  [r(^)]  a.s.  on  05, 


where  and  hereafter,  whenever  {x  :  H{x)  >  7fc+i,  x  G  {A^, . . . ,  A^^}}  =  0,  we  define 


^  =  0  . 


For  brevity,  we  use  the  following  shorthand  notations: 


F"  :=  Eg^[l3^Sk{H{X))I{H(x)>^,}i  ■=  Eg^[l3^ Sk{H{X))I{Hix)>^,}T{X% 

n  ■■=  ^^UH{X^))I{Hix>^)>,,^,y  Yt  :=  P^UH{X^))I{Hix^)>,,y 

We  also  let  7^  :=  _  Note  that  the  total  number  of  visits  to  step  3a  and  36 

of  MRASi  is  bounded  by  7^,  thus  for  any  k  >  %,  the  total  number  of  visits  to  step  3c  is 
greater  than  k  —  T^. 
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We  have 


oka  (ij(y^\\T 

Nk  l^i=lP  >)^{H(X>y)>^k+i] 

±  nnx^)  w  %^nxi 


1 

Affc  Z_(j=l  -'i 


-  E, 


+ 


a*.  [r(V)]  = 

1 

Wj.  E/i=l 


Vk 

-'r 


1  Y^jv*. 

ATj,  Z^j=l  -Ti 

Since  for  each  w  G  lk+i{^)  =  TfcCi^)  V  A:  >  AA(ti;),  it  is  straightforward  to  see  that  the 
first  term 

w.  i-  E*=\  ^ 


1  Y^-/Vfc  J_ 

Z-/i=l  Nk  ^i=l 


=  0,  y  k  >  Af{ijj),  y uj  G 


(5.30) 


To  show  that  the  second  term  also  converges  to  zero,  we  denote  by  Vk  the  event  Vk  = 
{^k  >  H{x*)—e}.  For  any  C  >  0,  we  also  let  Ck  be  the  event  Ck  =  {\^  J2^=i  Y^—Y’^\  >  C}. 


We  have 


p{Cki.o.)  =  p({Cfc  n  Vfc}  u  {Cfc  n  i.o.) 

=  P{Ck  n  Vk  i.o.),  since  P(y^  i.o.)  =  0  by  part  (1).  (5.31) 


It  is  easy  to  see  that  conditional  on  Ok  and  7^,  , . . . ,  Y^^  are  i.i.d.  and  E\Yl‘\9ki  7fc]  = 

Y^  V  i.  Furthermore,  by  assumption  A3',  conditional  on  the  event  Vfc,  the  support  [o^,  hk] 


of  the  random  variable  Y^  satisfies  [0^,6^]  C 
Hoeffding  inequality  ([40]), 

^  Nk 


(ps*y 

A/* 


Therefore,  we  have  from  the 


P(Cfc  |Vfc,0fc  =  0,7fc  =  7)  = 

i=l 


Vk,9k  =  =  7)) 


<  2  exp 

<  2  exp 


-2Nke 
ih  -  ak^X 

-2iVfcC"[A/*]2 


m 


'*\2k 


)  VA:  =  1,2.... 


(5.32) 
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Since 


p{Ck  nVk)  =  f  P{Ck  n  Vk  \ek  =  e,%  =  7)4,^, dj), 

J9,'y  ’ 


'e,Vk 


P{Ck  \Vk,Ok  =  9,7k  =  l)fi^^{de,d-f), 


where  4  ')  is  the  joint  distribution  of  random  variables  9^  and  7^,  we  have  by  (5.32), 


P(CfcnVfc)  <  2  exp 

<  2  exp 

=  2  exp 


-2iVfcC^[A/, 

(/35*)2fc 
-2(afe-^^Aro)C^[A42 
(/35*)2fc 

-2NoeX^f^  /  a 


yk>T„ 


a 


(4p)‘) 


Since  a/{f5S*)^  >  1  (by  assumption),  it  follows  that 


lim  P{Ck  n  Vk)  =  0. 

/c— >00 


Furthermore,  since  e  ®<l/xVx>0we  have 


oV'^  /(/35*)2\fc 


and  because  {(53*)^ /a  <  1,  we  have 


^p(CfcnVfc)  <r,+ 


a 


r  00 


E 


fc=0 


^oC2A2/2  V  a 


(/35*)2\fc 


<  00. 


Finally  by  the  Borel-Cantelli  lemma  and  (5.31), 


P{Ck  i.o)  =  P{Ck  n  Vk  i-o.)  =  0. 


Since  this  holds  for  any  C  >  0,  we  have  4  4  ^  w.p.l 


136 


By  following  the  same  argument  as  before,  we  can  also  show  that 
Fp  w.p.l.  And  since  lim^^oo  >  0  Vw  G  (i-®-)  (5-29)),  we  have 


as  A:  ^  oo  a.s.  on  ilV. 

•x  rU  J- 


1  \^k 

Nk  i 


Yk 


By  the  definition  of  gk+i{-),  the  above  result  together  with  (5.30)  suggests  that 


E~g^  [r(A)]  ^  [r(A)]  ask^oo  a.s.  on  Of. 


Thus,  in  conclusion,  we  have 


Eg,  [r(X)]  ^  E^^  [r(A)]  as  /c  ^  oo  w.p.l. 


On  the  other  hand,  by  Al,  A2,  and  following  the  proof  of  Theorem  5.3.1,  it  is  not 
difficult  to  show  that 

E-^^  [r(X)]  ^  r(x*)  as  /c  ^  oo  w.p.l. 

Hence  by  Lemma  5.5.1,  we  have 


hm  E~  [r(A)]  =  hm  Eg^  [r(A)]  =  r(x*)  w.p.l. 

/c— >oo  ^  /c— >00 


The  following  results  are  now  immediate. 

Corollary  5.5.2  (Multivariate  Normal)  For  continuous  optimization  problems  in  3^”, 
if  multivariate  normal  p.d.f.  ’s  are  used  in  MRASi,  i.e., 


f{x,9k)  = 


^  exp  Fk)'^^k  ^(x  -  Fk)), 


(27r)"|Sfc| 

e  >  0,  a  >  (PS*)'^,  and  assumptions  Al,  A2,  A3',  and  A4'  are  satisfied,  then 


lim  p,k  =  X* ,  and  lim  =  Onxn  w.pA. 

fc— >oo  Al— >oo 
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Corollary  5.5.3  (Independent  Univariate)  If  the  components  of  the  random  vector 
X  =  (Xi,  X2, .  ■ . ,  Xn)  are  independent,  each  with  a  univariate  p.d.f/p.m.f  of  the  form 

f{xi,'&i)  =  exp(xi??i  -  K{di))h{xi),  i?*  G  Vi  =  1, . . . ,  n, 
e  >  0,  a  >  {133*)“^ ,  and  assumptions  Al,  A2,  A?,' ,  AA' ,  and  Ah  are  satisfied,  then 
Yim  E-r  [X]  =  X*  w.pA,  where  9k '■=  {Hi,  ■  ■  ■ 

k^oo 

5.6  Numerical  Examples 

In  this  Chapter,  we  illustrate  the  performance  of  the  MRAS  method  for  both  contin¬ 
uous  and  combinatorial  optimization  problems.  In  the  former  case,  we  test  the  algorithm 
on  various  functions  that  are  well-known  in  global  optimization  and  compare  its  perfor¬ 
mance  with  that  of  the  standard  CE  method.  In  the  latter  case,  we  apply  the  algorithm 
to  several  Asymmetric  Traveling  Salesman  Problems  (ATSP),  which  are  typical  represen¬ 
tatives  of  NP-hard  combinatorial  optimization  problems. 

Remark  5.6.1  It  is  not  our  primary  intention  here  to  compare  our  algorithm  with  the 
CE  method  and  EDAs.  A  comprehensive  comparison  of  different  methods  is  beyond  the 
scope  of  this  research.  Our  main  goal  here  is  to  propose  a  novel  algorithm  with  provable 
convergence,  and  show  that  the  algorithm  is  promising  in  solving  some  difficult  optimiza¬ 
tion  problems.  The  performance  of  the  CE  method  on  continuous  functions  can  be  found 
in,  e.g.,  [51],  [66].  Its  performance  on  various  ATSP  instances  can  be  found  in,  e.g.,  ]26], 
[67]. 

We  now  discuss  some  implementation  issues  of  the  MRASi  algorithm. 

1.  Since  all  examples  considered  in  this  Chapter  are  minimization  problems,  whereas 
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MRAS  was  presented  in  a  maximization  context,  the  following  modifications  are 
required: 

•  S{-)  needs  to  be  initialized  as  a  strictly  decreasing  function  instead  of  strictly 
increasing.  Throughout  this  Chapter,  we  take 

S{H{x))  :=  exp  {—rH{x)}  ,  where  r  is  a  positive  constant. 

•  The  sample  (1  —  yo)-quantile  jk+i  will  now  be  calculated  by  first  ordering  the 
sample  performances  H{X^),  i  =  1,. . .  ,Nk  from  largest  to  smallest,  and  then 
taking  the  [(1  —  p)Nk~\t]i  order  statistic. 

•  We  need  to  replace  the  “>”  operator  with  “<”  operator  in  equation  (5.21). 

•  The  inequalities  at  step  3  need  to  be  replaced  with 

%+i{Pk,  Nk)  and  %+i{p,  Nk)  <  %  - 

respectively. 

2.  Similar  to  CE,  a  smoothed  parameter  updating  procedure  (cf.  e.g.,  [26],  [66])  is  used 
in  actual  implementation,  i.e.,  first  a  smoothed  parameter  vector  9k+i  is  computed 
at  each  iteration  k  according  to 

9k+i  :=  V  9k+i  +  (1  -  v)9k,  V  A:  =  0, 1, ... ,  and  9q  :=  9o, 

where  9k+i  is  the  parameter  vector  computed  at  step  4  of  MRASi,  and  v  G  (0, 1]  is 
the  smoothing  parameter;  then  f{x,9k+i)  (instead  of  f{x,9k+i))  is  used  in  step  1  to 
generate  new  samples.  It  is  important  to  note  that  this  modification  will  not  affect 
the  theoretical  convergence  of  our  approach. 
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3.  In  practice,  different  stopping  criteria  can  be  used.  The  simplest  method  is  to  stop 


the  algorithm  when  a  predefined  maximum  number  of  iterations  is  reached,  or  when 
the  total  computational  budget  is  exhausted.  In  the  numerical  experiments,  a  mixed 
stopping  rule  is  used:  We  stop  the  algorithm  either  when  no  significant  improvement 
in  is  obtained  for  several  consecutive  iterations  or  when  the  sample  size  at  a  single 
iteration  exceeds  some  predefined  threshold,  i.e.,  as  soon  as  either  one  of  the  following 
two  conditions  is  satisfied  at  iteration  k: 

(1)  maxi<i<rf \%  -  %+i\  <  t; 

(2)  Nk  >  iVmax; 

where  r  >  0  is  a  predefined  tolerance  level,  d  is  a  positive  integer,  and  A^max  is  the 
maximum  number  of  samples  allowed  per  iteration. 

4.  Another  practical  issue  is  that  in  order  to  obtain  a  valid  estimate  9k+i  at  each  it¬ 
eration  of  MRASi,  we  must  make  sure  that  enough  samples  are  used  in  parameter 
updating.  This  can  be  achieved  by  using  an  additional  parameter  Nmin,  and  per¬ 
forming  the  update  (5.21)  only  when  the  number  of  the  elite  samples  (i.e.,  those 
samples  having  performances  better  than  the  threshold  7fc+i)  is  greater  than  Akin¬ 
in  effect,  this  is  equivalent  to  searching  p  from  {pmim  Pk)  instead  of  (0,  pk)  at  step  3 
of  MRASi,  where  pmin  '■=  Nmin/Nk  ^  0  as  A:  ^  oo. 

5.6.1  Continuous  Optimization 

In  our  preliminary  experiments,  we  take  the  family  of  parameterized  p.d.f.’s  to  be 
multivariate  normal  p.d.f.’s.  Initially,  a  mean  vector  p,Q  and  a  covariance  matrix  Sq  are 
specified;  then  at  each  iteration  k  of  the  algorithm,  new  parameters  pk+i  and  Sfc+i  are 
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updated  according  to  the  respective  stochastic  counterparts  of  equations  (5.16)  and  (5.17). 
By  Corollary  5.5.2,  the  sequence  of  mean  vectors  {Jik}  will  converge  to  the  optimal  solution 
X* ,  and  the  sequence  of  covariance  matrices  {Bfc}  to  the  zero  matrix.  Throughout  this 
Chapter,  we  will  use  Jik  to  represent  the  current  best  solution  found  at  iteration  k. 

The  following  five  functions  {Hi,  i  =  1, . . . ,  5}  are  used  to  test  the  algorithm. 


(1)  Quadratic  function 

3 

Hi{x)  =  ''^xj,  where  X  =  (rci, X2, X3). 
i=l 

The  function  has  a  unique  global  minimum  /(0, 0, 0)  =  0. 


(2)  Two-dimensional  Rosenbrock  function 


H2{x)  =  100(xf  —  X2)^  +  (1  “  xf),  where  x  =  (xi,  X2). 


The  function  has  the  reputation  of  being  difficult  to  minimize  and  is  widely  used  to 
test  the  performance  of  different  optimization  algorithms.  It  has  a  global  minimum 

/(i,i)  =  o. 


(3)  Shekel’s  Foxholes 


Hsix)  = 


0.002  +  J2f=i  }  _  ,6 


where  a^- 1  =  {-32,  -16,  0, 16, 32,  -32,  -16, 0, 16,  32,  -32,  -16, 0, 16,  32,  -32,  -16, 


0,16,32,-32,-16,0,16,32}, 


2  =  {-32,  -32,  -32,  -32,  -32,  -16,  -16,  -16,  -16,  -16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 
16,32,32,32,32,32}  ,  and  x  =  (xi,X2).  The  function  has  24  local  minima  and  one 
global  minimum  /(— 32,  —32)  0.998004. 
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(4)  Corana’s  Parabola 


where 


4 

Ha{x)  =  Y, 

i=l 


0.15[0.05sgn(2;i)  —  Zi]‘^hj 

hixj 


\l\xi  —  Zi\  <  0.05, 


otherwise, 


Zi  =  0.2 


2 


+  0.49999 


sgn(xi). 


h  =  {1, 1000, 10, 100},  and  x  =  (xi,  X2,  xs,  X4).  In  the  region  —1000  <  Xi  <  1000,  i  = 
1,2, 3, 4,  the  above  function  has  more  than  10^^  local  minima,  which  is  very  difficult 
to  minimize.  It  has  a  global  minimum  /(O,  0,  0,  0)  =  0. 


(5)  Goldstein-Price  function 

H^{x)  =  (1  +  (xi  +  X2  +  1)^(19  -  14xi  +  3x1  ~  14x2  +  6x1X2  +  3x|)) 

(30  +  (2xi  —  3x2)^(18  —  32xi  +  12xf  +  48x2  —  36xiX2  +  27x|)), 
where  x  =  (xi,X2)^.  The  function  has  four  local  minima  and  a  global  minimum 
/(0,-l)  =  3. 

For  all  five  problems,  the  same  set  of  parameters  is  used  to  test  MRAS:  e  =  10“^, 
initial  sample  size  Nq  =  100,  po  =  0.2,  A  =  0.02,  a  =  1.5,  r  =  0.1,  the  stopping  control 
parameters  d  =  5,  r  =  10“^,  =  50000,  Nmin  =  5n,  and  the  smoothing  parameter 

V  =  0.5.  The  initial  mean  vector  po  is  a  n-by-1  vector  of  all  10s,  and  So  is  a  n-by-n  diagonal 
matrix  with  all  diagonal  elements  equal  to  200,  where  recall  that  n  is  the  dimension  of  the 
problem. 

Table  5.1  shows  the  performance  of  the  algorithm  on  the  five  test  functions.  For 
each  function,  we  performed  50  independent  replication  runs  of  the  algorithm,  and  the 
means  and  standard  errors  are  reported  in  the  table,  where  N^otai  is  the  total  number  of 
function  evaluations,  p final  is  the  final  value  of  p,  and  H*  is  the  averaged  value  of  the 
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Ntotal  (std) 

Pfinal  (std) 

H*  (std) 

Hi{x*) 

Me 

Hi 

4.38e-h03(6.77e-h01) 

0.13(6.21e-03) 

9.86e-09(1.12e-09) 

0 

50 

H2 

1.21e-h04(4.89e-h02) 

0.04(2.40e-03) 

2.29e-09(3.13e-10) 

0 

50 

Hs 

2.17e-h04(7.16e-h02) 

0.02(1. lle-03) 

2.40(4.15e-01) 

0.998 

37 

Ha 

7.43e-h03(1.61e-h02) 

0.14(4.19e-03) 

O.OO(O.OOe-OO) 

0 

50 

H5 

5.81e-h03(1.40e-h02) 

0.11(5.88e-03) 

3.00(5.30e-10) 

3 

50 

Table  5.1:  Performance  of  MRAS  on  five  test  functions,  based  on  50  independent  replica¬ 
tion  runs.  The  standard  errors  are  in  parentheses. 

function  Hi(-)  at  the  best  solution  visited  by  the  algorithm.  The  optimal  value  Hi{x*)  is 
included  for  reference,  and  indicates  the  number  of  runs  out  of  50  trials  in  which  an 
e-optimal  solution  was  found.  The  algorithm  performs  quite  well  in  most  cases,  except 
for  where  only  37  e-optimal  solutions  were  found.  represents  a  class  of  continuous 
optimization  problems  that  are  extremely  difficult  to  solve  for  most  model-based  sampling 
approaches.  A  graphical  representation  of  the  function  is  given  in  Figure  5.3.  Notice 
that  the  function  values  at  the  25  “holes”  (local  minima)  are  very  close  to  each  other;  thus 
in  order  to  locate  the  global  optimal  solution,  the  algorithm  must  make  sure  that  samples 
are  drawn  from  the  right  “hole” ,  and  there  must  be  enough  samples  to  fall  in  this  “hole” 
to  guarantee  that  the  parameter  vectors  are  updated  in  the  right  direction. 

For  comparison  purposes,  we  also  applied  the  CE  method  to  the  above  five  test 
functions,  where  we  have  used  the  multivariate  normal  p.d.f.  with  independent  compo¬ 
nents  (cf.  e.g.,  [51]  for  detailed  algorithm  description  and  implementation  issues).  We 
have  tested  different  sets  of  parameters  (i.e.,  different  {N,p)  combinations);  the  results 
reported  in  Table  5.2  are  based  on  the  following  “good”  parameter  settings:  sample  size 
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Figure  5.3:  Shekel’s  Foxholes,  where  —50  <  Xi  <  50,  i  =  1,2. 

N  =  1000  (recall  that  the  CE  method  is  non-adaptive,  so  the  same  number  of  samples 
will  be  generated  at  each  iteration),  p  =  0.005,  smoothing  parameter  v  =  0.7,  and  the 
algorithm  is  stopped  either  when  there  exists  A:  >  0  such  that  maxi<j<5  |%  —  <  10“^ 

or  when  the  total  number  of  samples  generated  exceeds  2  x  10®,  where  %  is  the  sample 
(1  —  /9)-quantile  generated  at  the  fcth  iteration  of  CE. 

Again,  the  mean  vector  /xq  is  initialized  as  a  n-by-1  vector  of  all  10s  and  the  variances 
are  taken  to  be  a  n-by-1  vector  with  all  elements  equal  to  200. 


Ntotal  (std) 

H*  (std) 

Hi{x*) 

Me 

Hi 

1.69e-h04(1.73e-h02) 

4.94e-05(5.13e-06) 

0 

7 

H2 

1.72e-h04(1.18e-h02) 

1.92e-05(2.93e-06) 

0 

24 

Hs 

1.05e-h04(1.08e-h02) 

8.83(2.54e-01) 

0.998 

0 

Hi 

5.84e-h04(6.06e-h03) 

1.35e-03(4.21e-04) 

0 

38 

H5 

1.89e-h05(4.77e-h03) 

3.00(5.63e-05) 

3 

0 

Table  5.2:  Performance  of  the  standard  CE  method  on  five  test  functions,  based  on  50 
independent  runs.  The  standard  errors  are  in  parentheses. 
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From  Tables  5.1  and  5.2,  we  see  that  MRAS  uses  fewer  samples  than  CE  does, 
but  produces  more  accurate  solutions.  In  general,  the  sequence  {7^}  generated  by  CE 
may  often  converge  quickly  to  a  small  neighborhood  of  H{x*)]  however,  since  no  sample 
performances  are  used  in  parameter  updating,  (i.e.,  the  top  p%  samples  are  all  considered 
to  be  of  the  same  importance  regardless  of  their  sample  performances),  the  future  search 
will  be  biased  toward  the  region  that  has  been  sampled  most.  In  particular,  for  the 
case,  since  the  function  values  at  different  local  minima  are  very  close  to  each  other,  even 
if  the  “hole”  with  the  global  minimum  has  been  sampled  during  the  search  process,  CE 
still  cannot  distinguish  the  global  minimum  from  the  other  local  minima;  instead  CE  will 
easily  get  stuck  in  the  “hole”  that  has  been  sampled  the  most.  As  a  result,  we  see  that 
the  algorithm  gets  trapped  in  local  minima  in  all  50  trials.  In  contrast,  the  parameter 
updating  procedure  in  MRAS  is  weighted  by  the  performance  function  so  that  better 
samples  will  have  more  positive  influence  on  the  updating  process.  Consequently,  the 
searches  in  MRAS  will  be  biased  toward  the  region  containing  more  promising  samples. 

Table  5.3  gives  the  performance  of  CE  and  MRAS  on  function  using  different 
sample  sizes  and  p  values  (all  other  parameters  are  the  same  as  before).  Test  results 
indicate  that  increasing  the  samples  size  in  CE  has  little  effect  on  the  quality  of  the 
resultant  solutions.  We  see  that  the  algorithm  consistently  gets  stuck  in  local  minima  in 
repeated  experiments.  On  the  other  hand,  for  MRAS  with  Nq  =  200,  e-optimal  solutions 
were  found  in  more  than  90%  of  the  total  simulation  runs;  whereas  for  the  A^o  >  500  cases, 
e-optimal  solutions  were  found  in  all  50  runs. 

To  illustrate  the  performance  of  the  algorithm  on  high-dimensional  problems,  we 
also  applied  MRASi  to  the  following  benchmark  problems,  which  have  been  previously 
studied  in  e.g.,  [24],  [61],  [88],  and  [51].  Eunctions  Hq  is  a  4-dimensional  problem  which  has 
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method 

parameters 

Ntotal  (std) 

HI  (std) 

M, 

A^=1000,  p=0.1 

1.47e-h04(1.39e-h02) 

18.29(0.18) 

0 

A^=1000,  p=om 

1.13e-h04(1.08e-h02) 

11.90(0.27) 

0 

N=2000,  p=0.1 

2.91e-h04(2.07e-h02) 

18.30(0.09) 

0 

A^=2000,  p=om 

2.25e-h04(1.70e-h02) 

12.27(0.19) 

0 

N=2000,  /9=0.005 

2.14e-h04(2.25e-h02) 

8.43(0.21) 

0 

CE 

Ar=5000,  p=0.1 

7.19e-h04(3.47e-h02) 

18.30(8.07e-ll) 

0 

A^=5000,  p=om 

5.70e-h04(3.78e-h02) 

12.52(0.14) 

0 

N=5000,  p=0.001 

4.87e-h04(6.67e-h02) 

5.61(0.32) 

0 

A^=10000,  p=0.1 

1.42e-h05(6.10e-h02) 

18.30(4.80e-ll) 

0 

A^=10000,  p=0.01 

1.12e-h05(6.19e-h02) 

12.67(1.96e-12) 

0 

A^=10000,  /9=0.001 

1.01e-h05(1.39e-h03) 

4.80(0.32) 

0 

A^o~200,  pq=0.2 

2.27e-h04(6.77e-h02) 

1.14(0.06) 

45 

No=200,  po=0.1 

2.17e-h04(7.14e-h02) 

1.08(0.05) 

47 

MRASi 

No=500,  po=0.2 

3.01e-h04(6.67e-h02) 

0.998(3.41e-ll) 

50 

No=500,  po=0.1 

2.76e-h04(8.70e-h02) 

0.998(3.92e-ll) 

50 

A^o=1000,  po=0.2 

5.62e-h04(8.30e-h02) 

0.998(3.41e-ll) 

50 

A^o=1000,  po=0.1 

4.31e-h04(8.46e-h02) 

0.998(3.81e-ll) 

50 

Table  5.3:  Performance  of  CE  and  MRAS  on  test  function  based  on  50  indepen¬ 
dent  simulation  runs.  The  standard  errors  are  in  parentheses.  The  optimum  ~ 

0.998004. 

only  a  few  local  optima;  however,  the  minima  are  separated  by  plateaus  and  are  relatively 
far  apart.  Functions  Hj  and  Hg  are  20-dimensional  badly-scaled  problems.  Functions  Hg 
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and  Hiq  are  highly  multimodal  and  the  number  of  local  optima  increases  exponentially 
with  the  problem  dimension.  Function  Hu  is  both  badly  scaled  and  highly  multimodal. 
The  graphical  representations  of  some  of  these  functions  in  two  dimensions  are  plotted  in 
Figure  5.4. 


(6)  Shekel’s  function 

Hq{x)  =  X]  ((a^  -  aifix  -  at)  +  , 

i=l 

where  x  =  (xi,  X2,  X3, 0:4)'^,  oi  =  (4, 4, 4, 4)'^,  02  =  (1,1,1,!)^,  03  =  (8, 8, 8, 8)^, 
04  =  (6,  6, 6, 6)^,  as  =  (3,  7, 3,  7)^,  and  c  =  (0.1,  0.2,  0.2, 0.4,  0.4).  The  global  mini- 
mizer  x*  (4,4, 4, 4)^^,  and  Hq{x*)  ^  —10.153. 

(7)  Rosenbrock  function 

n— 1 

H7{x)  =  ^  100(Xi+i  -  x‘ff  +  {Xi  -  1)^, 
i=l 

where  n  =  20.  The  global  minimum  is  x*  =  (1, . . . ,  1)^,  and  H7{x*)  =  0. 

(8)  Powel  singular  function 

n—2 

Hs{x)  =  ^  [(xi_i  +  lOxj)^  +  5(xi+i  -  Xi+2)^  +  (xi  -  2xj+i)^  +  10(xi_i  -  Xj+2)^] , 

i=2 

where  n  =  20,  x*  =  (0, . . . ,  0)"^,  and  Hs{x*)  =  0. 

(9)  Trigonometric  function 

n 

Hg{x)  =  1  +  ^  8sin^  (7(xi  —  0.9)^)  +  6sin^  (l4:(xi  —  0.9)^)  +  (xj  —  0.9)^, 

i=l 

where  n  =  20,  x*  =  (0.9, . . . ,  0.9)^,  and  Hg{x*)  =  1. 


(10)  Griewank  function 


=  4000  (^)  + 


i=l 


2=1 


where  n  =  20,  x*  =  (0, . . . ,  0)"^,  and  Hio{x*)  =  0. 
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(11)  Pinter’s  function 


n  n 

^11  (x)  =  E  ixj  +  E  20i  sin^  {xi-i  sin  Xi  —  Xi  +  sin  Xj+i) 

i=l  i=\ 

n 

+  ^  i  logio  (1  +  -  2xi  +  3xi+i  -  cos  x*  +  1)^) , 

i=l 

where  xq  =  Xn+i  =  xi,  n  =  20,  x*  =  (0, . . . ,  0)"^,  and  i7ii(x*)  =  0. 


(c)  (d)  PTii 

Figure  5.4:  Selected  test  problems  in  two  dimensions,  (a)  Hq:  Shekel;  (b)  Hj:  Rosenbrock; 
(c)  Hg:  Trigonometric;  (d)  Hu:  Pinter. 


For  all  problems  Hq—Hu,  the  same  set  of  parameters  is  used  to  test  MRASi:  £  = 
10“®,  initial  sample  size  Nq  =  1000,  po  =  0.1,  A  =  0.01,  a  =  1.1,  r  =  10“^,  smoothing 
parameter  v  =  0.2,  and  Nmin  =  5n.  The  initial  mean  vector  pg  is  a  n-by-1  vector  with 
each  component  randomly  selected  from  the  interval  [—50,  50]  according  to  the  uniform 
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Test 

MRASi 

CE  {v  =  0.7) 

CE  {v  =  0.2) 

SA 

Prob. 

H*  (stderr) 

H*  (stderr) 

M, 

H*  (stderr) 

H*  (stderr) 

M, 

He 

-10.15(3e-7) 

50 

-8.0(0.5) 

34 

-9.9(0.13) 

0 

-7.3(0.4) 

2 

Hj 

11.8(0.5) 

0 

27.9(3.43) 

0 

15.9(2e-02) 

0 

203.7(11.3) 

0 

Hs 

3e-10(2e-ll) 

50 

Ie-h4(4e-h3) 

3 

3e-6(2e-7) 

50 

65.9(3.0) 

0 

Hg 

1.6(0.13) 

24 

l.O(OOe-OO) 

50 

1.0(6e-12) 

50 

65.2(1.22) 

0 

Hw 

4e-3(7e-4) 

28 

2e-4(2e-4) 

49 

2e-12(4e-13) 

50 

0.15(0.04) 

0 

Hu 

3e-9(6e-10) 

50 

2.3(le-3) 

0 

6e-4(3e-05) 

0 

1.7e-h3(51) 

0 

Table  5.4:  Performance  of  different  algorithms  on  benchmark  problems  Hq  —  i^n,  based 
on  50  independent  runs.  The  standard  errors  are  in  parentheses. 

distribution,  and  Sq  is  a  n-by-n  diagonal  matrix  with  all  diagonal  elements  equal  to  500. 

For  comparison  purposes,  we  also  applied  the  CE  method  and  the  SA  algorithm  to 
the  above  test  functions.  For  CE,  we  have  used  the  univariate  normal  p.d.f.  with  parameter 
values  suggested  in  [51]:  sample  size  N  =  2000,  p  =  0.01,  smoothing  parameter  v  =  0.7. 
Again,  the  initial  mean  vector  p,Q  is  randomly  selected  from  [—50,50]"'  according  to  the 
uniform  distribution,  and  Sq  is  a  n-by-n  diagonal  matrix  with  all  elements  equal  to  500. 
We  found  empirically  that  the  above  parameters  work  well  for  some  functions,  but  in  some 
other  cases,  the  variance  matrices  in  CE  may  converge  too  quickly  to  the  zero  matrix,  which 
freezes  the  algorithm  at  some  low  quality  solutions.  To  address  this  issue,  for  each  problem, 
we  also  tried  CE  with  different  values  of  the  smoothing  parameter.  In  the  numerical  results 
reported  below,  we  have  used  a  smaller  smoothing  parameter  value  v  =  0.2,  which  gives 
reasonable  performance  for  all  test  cases.  Eor  SA,  we  have  used  the  parameters  suggested 
in  [24]:  initial  temperature  T  =  50000,  temperature  reduction  factor  vt  =  0.85,  the  search 
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Shekel's  function  20-D  Rosenbrock 


(a)  Fe  (b)  Hj 


(e)  Fio  (f) 

Figure  5.5:  Average  performance  (mean  of  50  replications)  of  MRAS,  CE,  and  SA  on 
selected  benchmark  problems. 
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neighborhood  of  a  point  x  is  taken  to  be  Af{x)  =  {y  :  maxi<j<n  \xi  —  yi\  <  1},  and  the 
initial  solution  is  uniformly  selected  from  [—50,50]"'. 

For  each  problem,  we  performed  50  independent  runs  of  all  three  algorithms,  and 
numerical  results  are  reported  in  Table  5.4,  where  H*  is  the  averaged  value  of  the  function 
ifj(-)  at  the  best  solution  visited  by  the  algorithm,  with  standard  error  in  parentheses, 
and  Mg  indicates  the  number  of  runs  that  an  e-optimal  solution  was  found  out  of  50 
trials.  We  also  plotted  in  Figure  5.5  the  average  function  values  of  the  current  best 
solution  given  the  number  of  samples  generated  for  selected  benchmark  problems.  The 
performance  comparison  is  based  on  the  same  amount  of  computational  effort,  where 
for  each  algorithm,  the  total  number  of  function  evaluations  (i.e.,  sample  size)  is  set  to 
100,000  for  Hq,  and  400,000  for  Hj  —  Hn.  Here,  we  choose  to  use  the  total  number  of 
function  evaluations  to  estimate  the  computational  efforts  of  different  algorithms,  because 
the  running  time  of  all  three  algorithms  is  dominated  by  the  time  spent  in  evaluating  the 
objective  function. 

Functions  Hq  has  only  a  few  local  minima,  and  since  SA  combines  local  search,  it 
may  quickly  locate  one  of  them.  However,  as  we  can  see,  SA  stops  making  improvement 
during  the  early  search  phase.  This  is  caused  by  the  plateaus  surrounding  the  local 
minima,  which  makes  it  very  difficult  for  SA  to  escape  local  optima.  In  contrast,  since 
both  MRASi  and  CE  are  population-based,  they  show  more  robustness  in  dealing  with 
local  optima.  We  see  that  CE  (v  =  0.7)  does  not  always  converge  to  the  global  optimal 
solution,  but  it  still  performs  better  than  SA  does.  Note  that  decreasing  the  value  of  the 
smoothing  parameter  slows  down  the  convergence  of  CE.  In  particular,  for  the  v  =  0.2 
case,  although  better  average  function  values  are  achieved  in  CE,  no  e-optimal  solutions 
were  found  within  the  allowed  simulation  budget  because  of  the  slow  convergence.  MRASi 
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consistently  finds  e-optimal  solutions  in  all  simulation  runs. 


For  7^7,  none  of  these  three  algorithms  found  e-optimal  solutions.  However,  Fig¬ 
ure  5.5(b)  indicates  that  both  MRASi  and  CE  perform  better  than  SA  when  the  total 
sample  size  is  large  enough.  CE  with  v  =  0.2  converges  slowly,  but  slightly  outperforms 
CE  {v  =  0.7)  after  about  170,000  function  evaluations.  MRASi  performs  the  best,  it 
has  a  similar  convergence  rate  as  CE  {v  =  0.7)  and  finds  better  solutions  than  the  other 
algorithms  do.  On  Hg,  MRASi  is  clearly  superior  to  both  CE  and  SA.  It  converges  to  the 
global  optimal  solution  in  all  50  runs  at  an  exponential  rate.  The  performance  of  SA  is 
similar  to  the  Hj  case,  whereas  the  performance  of  CE  {v  =  0.7)  is  even  worse  than  that 
of  SA,  as  we  can  see,  the  algorithm  frequently  gets  trapped  at  solutions  that  are  far  from 
optimal.  CE  with  v  =  0.2  yields  much  better  performance. 

Hg  and  Hiq  are  highly  multimodal  functions.  CE  (u  =  0.7)  works  better  than  both 
MRASi  and  SA.  It  not  only  converges  the  fastest  but  also  finds  e-optimal  solutions  in 
almost  all  runs.  SA  finds  no  e-optimal  solutions  in  any  of  the  runs.  MRASi  consistently 
outperforms  SA,  and  converges  to  the  optimal  solution  in  50%  of  the  total  simulation  runs 
in  both  cases.  Initially,  MRASi  converges  very  fast  to  good  values  near  the  optimum, 
then  it  proceeds  at  a  slower  rate  and  spends  most  of  the  time  in  fine-tuning  the  solution. 
The  behavior  of  MRASi  can  be  explained  by  looking  at  the  parameter  updating  equations 
(5.16)  and  (5.17).  Since  the  values  of  Hg  and  Hig  at  local  minima  near  the  optimum  are 
very  close  to  each  other,  the  parameter  updating  in  MRASi  is  dominated  by  the  density 
function  in  the  denominator,  especially  when  the  iteration  counter  k  is  small. 

Hu  contains  both  a  badly-scaled  quadratic  term  and  some  badly-scaled  noise  terms. 
Eor  this  function,  SA  does  not  seem  to  be  competitive  at  all.  Similar  to  the  H^  and  Hg 
cases,  CE  (v  =  0.7)  converges  the  fastest,  but  stagnates  at  some  non-optimal  solutions  in 
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all  runs.  Using  v  =  0.2  in  CE  greatly  improves  the  solution  quality  but  slows  down  the 
convergence  speed.  The  initial  behavior  of  MRASi  is  similar  to  the  Hg  and  Hiq  cases,  but 
the  algorithm  outperforms  CE  (v  =  0.7)  after  about  170,000  function  evaluations,  and 
then  approaches  the  optimum  at  an  exponential  rate. 

The  above  comparison  seems  to  suggest  that  MRASi  is  better  adapted  to  optimiza¬ 
tion  of  badly  scaled  multimodal  problems,  whereas  CE  works  best  on  problems  that  are 
well-scaled  and  contain  a  large  number  of  local  optima.  Of  course,  a  more  comprehensive 
numerical  study  needs  to  be  carried  out  in  order  to  confirm  this  finding. 


5.6.2  Combinatorial  Optimization 


In  this  Chapter,  we  present  the  performance  of  MRAS  on  various  ATSP  problems. 
All  test  cases  are  taken  from  the  URL 

http : / /www . iwr . uni-heidelberg . de/groups/comopt/sof tware/TSPLIB95 . 

Eor  each  ATSP  problem  with  Nc  cities,  an  A^c-by-A^c  distance  matrix  G  is  given, 
whose  (i,  j)th  element  Gij  represents  the  distance  from  city  i  to  city  j.  The  goal  is  to  find 
the  shortest  path  that  visits  all  the  cities  and  returns  to  the  starting  city.  Mathematically, 
the  problem  can  be  formulated  as  follows: 


where  x  := 
tours. 


miiiH(x)  :=  min 
x&X  x&X 


jN,-l  I 

i  Gxi,xi+i  +  Gxff^^xi  r 


(5.33) 


{xi,X2,  ■  ■  ■  ,X]\fc,xi)  is  an  admissible  tour,  and  X  is  the  set  of  all  admissible 


We  use  the  same  technique  as  in  [67]  and  [26]  for  solving  these  problems,  i.e.,  we 
associate  for  each  distance  matrix  G  an  initial  state  transition  matrix  Pq,  whose  {i,j)th 
element  specifies  the  probability  of  transitioning  from  city  i  to  city  j.  Thus,  at  each 
iteration  of  MRAS  the  following  two  steps  are  fundamental: 
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•  Generating  random  (admissible)  tours  according  to  the  transition  matrix  and  eval¬ 
uate  the  performance  of  each  sample  tour. 


•  Updating  the  transition  matrix  based  on  the  sample  tours  generated  from  the  pre¬ 
vious  step. 


The  detailed  discussion  of  how  to  generate  admissible  tours  can  be  found  in  e.g.,  [26].  We 
now  briefly  address  the  issue  of  how  to  update  the  transition  matrix.  At  each  iteration 
k  of  MRAS,  the  p.d.f.  f{-,Pk)  on  X  is  parameterized  by  the  transition  matrix  and  is 
given  by 

Nc  Nc  ^ 

f{x,Pk)  =  HE  Pkik  j  (bl  ’ 

1=1  i,j 

where  Xij{l)  is  the  set  of  all  tours  in  X  such  that  the  Th  transition  is  from  city  i  to 
city  j.  It  is  straightforward  to  show  that  the  new  transition  matrix  Pk+i  is  updated  in 
equation  (5.21)  as 


Pk+i{i,j)  = 


(5.34) 


where  , . . . ,  are  the  i.i.d.  sample  tours  generated  from  /(•,  i\),  7a:+i  is  defined  as 
in  equation  (5.22),  and  Ajj  represents  the  set  of  tours  in  which  the  transition  from  city  i 
to  city  j  is  made. 

The  performance  of  the  algorithm  on  various  ATSP  problems  is  reported  in  Ta¬ 
ble  5.5.  For  each  of  the  7  instances,  we  performed  10  independent  runs  of  the  algorithm. 
In  Table  5.5,  Ntotai  is  the  total  number  of  tours  generated  (mean  and  standard  error 
reported),  is  the  length  of  the  shortest  path,  and  H*  are  the  worst  and  best 

solutions  obtained  out  of  10  trials,  d*  and  6*  are  the  respective  relative  errors  for  and 
p[* ,  and  6  is  the  relative  error  (mean  and  standard  error  reported).  For  all  cases,  e  =  1, 
the  initial  samples  Nq  =  1000,  po  =  0.1,  A  =  0.02,  a  =  1.5,  r  =  0.1,  the  stopping  control 
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parameters  d  =  5,  r  =  0,  Nmax  =  lOA^^)  smoothing  parameter  v  =  0.5,  and  the  initial 
transition  matrix  Pq  is  initialized  as  a  stochastic  matrix  whose  {i,j)th  entry  is  proportional 


to  the  inverse 

of  the  {i,j)th  entry  of  G,  i.e.,  Po(bj)  ot 

and 

,'^o(bj) 

1  =  1  Vi. 

ATSP 

Nc 

Ntotai  (std  err) 

Hf)est 

H* 

d* 

(5* 

5  (std  err) 

ftv33 

34 

7.95e-h4(3.25e-h3) 

1286 

1364 

1286 

0.061 

0.000 

0.023(0.008) 

ftv35 

36 

1.02e-h5(3.08e-h3) 

1473 

1500 

1475 

0.018 

0.001 

0.008(0.002) 

ftv38 

39 

1.31e-h5(4.90e-h3) 

1530 

1563 

1530 

0.022 

0.000 

0.008(0.003) 

p43 

43 

1.02e-h5(4.67e-h3) 

5620 

5637 

5620 

0.003 

0.000 

0.001(2.5e-4) 

ry48p 

48 

2.62e-h5(1.59e-h4) 

14422 

14810 

14446 

0.027 

0.002 

0.012(0.003) 

ft53 

53 

2.94e-h5(1.58e-h4) 

6905 

7236 

6973 

0.048 

0.010 

0.029(0.005) 

ft  70 

70 

4.73e-h5(2.91e-h4) 

38673 

39751 

38744 

0.028 

0.002 

0.017(0.003) 

Table  5.5:  Performance  of  MRAS  on  various  ATSP  problems  based  on  10  independent 
replications.  The  standard  errors  are  in  parentheses. 

5.7  Conclusions 

In  this  Chapter,  we  have  proposed  a  randomized  search  technique  called  Model 
Reference  Adaptive  Search  (MRAS)  for  solving  general  global  optimization  problems.  The 
method  iteratively  updates  a  parameterized  probability  distribution  over  the  solution  space 
so  that  the  sequence  of  candidate  solutions  generated  from  this  distribution  will  converge 
asymptotically  to  the  global  optimum.  We  have  provided  a  particular  instantiation  of 
the  framework  and  established  its  global  convergence  properties  in  both  continuous  and 
discrete  (combinatorial)  domains.  In  addition,  we  have  explored  the  relationship  between 
the  recently  proposed  Cross-Entropy  (CE)  method  and  MRAS,  and  showed  that  the  CE 
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method  can  also  be  interpreted  as  an  instance  of  the  MRAS  framework.  Finally,  we 


have  also  carried  out  detailed  numerical  experiments  to  investigate  the  performance  of  the 
method. 

Throughout  this  whole  chapter,  most  of  the  theoretical  and  empirical  analysis  work 
has  been  focused  on  an  instantiation  of  the  framework.  However,  we  emphasize  that  the 
contribution  of  this  research  goes  far  beyond  this  particular  instantiation  in  that  it  pro¬ 
vides  a  general  framework  for  designing  and  analyzing  various  model-based  optimization 
algorithms.  In  MRAS,  the  task  of  sampling  candidate  solutions  and  the  task  of  updating 
probabilistic  models  are  split  in  a  natural  way,  and  there  is  considerable  flexibility  in  the 
choices  of  reference  distributions.  Thus,  by  carefully  selecting  the  reference  distributions, 
one  can  construct  different  instantiations  of  the  framework.  Moreover,  the  convergence 
analysis  of  these  instantiation  algorithms  can  simply  be  ascribed  to  the  study  of  the  prop¬ 
erties  of  the  reference  distributions. 

The  MRASi  algorithm  demonstrated  great  promise  on  some  preliminary  examples, 
but  practical  implementation  issues  remain.  For  example,  selection  of  the  input  parameters 
in  our  numerical  experiments  was  based  mainly  on  trial  and  error.  For  a  given  problem, 
how  to  determine  a  priori  the  most  appropriate  values  of  these  parameters  is  an  open  issue. 
Designing  an  adaptive  scheme  to  update  these  parameters  during  the  search  process  may 
also  enhance  the  convergence  rate  of  the  algorithm. 

A  more  important  line  of  research  is  to  extend  the  MRAS  method  to  stochastic 
optimization  problems,  where  the  function  values  can  only  be  observed  in  the  presence  of 
noise.  The  construction  of  a  practically  efficient  generalization  of  MRAS  with  provable 
convergence  is  addressed  in  Chapter  6. 
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Chapter  6 

A  Model  Reference  Adaptive  Search  Method  for  Stochastic  Global  Optimization 
6.1  Introduction  and  Motivation 

In  Chapter  5,  we  have  proposed  a  unifying  framework  called  Model  Reference  Adap¬ 
tive  Search  (MRAS)  for  solving  deterministic  global  optimization  problems.  In  this  Chap¬ 
ter,  we  discuss  how  to  extend  the  framework  to  solving  stochastic  optimization  problems. 
Stochastic  problems  arise  in  a  wide  range  of  areas  such  as  manufacturing,  communica¬ 
tion  networks,  system  design,  and  financial  engineering.  In  contrast  to  their  deterministic 
counterparts,  such  problems  are  typically  much  more  difficult  to  solve,  either  because  an 
explicit  relation  between  the  objective  function  and  the  underlying  decision  variables  is 
unavailable  or  because  the  cost  of  a  precise  evaluation  of  the  objective  function  is  too 
prohibitive.  Oftentimes,  one  has  to  use  simulation  or  real-time  observations  to  evaluate 
the  objective  function.  In  such  situations,  all  the  objective  function  evaluations  will  con¬ 
tain  some  noise,  so  special  techniques  are  generally  used  (as  opposed  to  the  deterministic 
optimization  methods)  in  order  to  filter  out  the  noisy  components. 

There  are  two  major  techniques  to  address  the  function  evaluation  noise  arising 
from  the  stochastic  setting.  One  simple  approach  is  to  spend  a  significant  amount  of 
computational  effort  at  each  point  the  algorithm  visits  in  order  to  obtain  a  precise  estimate 
of  the  objective  function  value,  and  then  use  deterministic  optimization  approach  to  solve 
the  underlying  problem.  In  this  respect,  the  extension  of  MRAS  to  stochastic  settings 
should  be  relatively  straightforward.  However,  questions  arise  as  to  what  really  quantifies  a 
precise  estimate,  how  much  computational  effort  should  be  invested  at  each  point,  and  how 
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the  estimates  of  the  objective  function  values  will  affect  the  final  solutions  of  the  algorithm. 
These  questions  are  not  easy  to  answer;  moreover,  when  the  function  evaluation  cost  is 
expensive,  to  obtain  a  precise  estimate  of  the  objective  function  value  is  often  infeasible. 
To  circumvent  these  difficulties,  we  resort  to  the  alternative  approach,  which  does  not 
require  obtaining  highly  precise  estimates  of  the  objective  function  values  each  time  the 
algorithm  visits  a  solution.  However,  we  need  to  modify  the  MRAS  approach  intended  for 
deterministic  problems  in  order  to  yield  good  performance  in  the  presence  of  noise. 

The  method  we  propose  in  this  Chapter  is  called  stochastic  model  reference  adaptive 
search  (SMRAS),  which  is  essentially  a  generalization  of  the  MRAS  method  for  determin¬ 
istic  optimization  with  some  appropriate  modifications  and  extensions  required  for  the 
stochastic  setting.  The  idea  behind  SMRAS,  as  in  MRAS  for  deterministic  optimization, 
is  to  use  a  pre-specified  parameterized  probability  distribution  family  to  generate  candi¬ 
date  solutions,  and  to  use  a  sequence  of  convergent  reference  distributions  to  facilitate  and 
guide  the  updating  of  the  parameters  associated  with  the  parameterized  family  at  each 
step  of  the  iteration  procedure.  A  major  modification  from  the  original  MRAS  method  is 
in  the  way  the  sequence  of  reference  distributions  is  constructed.  In  MRAS,  reference  dis¬ 
tributions  are  idealized  probabilistic  models  constructed  based  on  the  exact  performance 
of  the  candidate  solutions.  In  the  stochastic  case,  however,  the  objective  function  cannot 
be  evaluated  deterministically,  so  the  sample  average  approximations  of  the  (idealized) 
reference  distributions  are  used  in  SMRAS  to  guide  the  parameter  updating.  We  show 
that  for  a  class  of  parameterized  distributions,  i.e.,  the  so-called  Natural  Exponential  Fam¬ 
ily  (NEF),  SMRAS  converges  with  probability  one  to  a  global  optimal  solution  for  both 
stochastic  continuous  and  discrete  problems.  To  the  best  of  our  knowledge,  SMRAS  is 
the  first  model-based  search  method  for  solving  general  stochastic  optimization  problems 
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with  provable  convergence. 


6.2  A  Brief  Review  of  Stochastic  Optimization  Solution  Techniques 

There  are  some  obvious  distinctions  between  the  solution  techniques  for  stochastic 
optimization  when  the  decision  variable  is  continuous  and  when  it  is  discrete.  Although 
some  techniques,  in  principle,  can  be  applied  to  both  types  of  problems,  they  require  some 
suitable  modifications  in  order  to  switch  from  one  setting  to  another. 

A  well-known  class  of  methods  for  solving  stochastic  optimization  problems  with 
continuous  decision  variables  is  stochastic  approximation  (SA).  These  methods  mimic 
the  classical  gradient-based  search  method  in  deterministic  optimization,  and  rely  on  the 
estimation  of  the  gradient  of  the  objective  function  with  respect  to  the  decision  variables. 
Because  they  are  gradient-based,  these  methods  generally  find  local  optimal  solutions.  In 
terms  of  the  different  gradient  estimation  techniques  employed,  the  SA  algorithms  can 
be  generally  divided  into  two  categories:  algorithms  that  are  based  on  direct  gradient 
estimation  techniques,  the  best-known  of  which  are  perturbation  analysis  (PA)  and  the 
likelihood  ratio/score  function  (LR/SF)  method  ([69]),  and  algorithms  that  are  based  on 
indirect  gradient  estimation  techniques  like  finite  difference  and  its  variations  ([78]).  A 
detailed  review  of  various  gradient  estimation  techniques  can  be  found  in  [32] . 

When  the  underlying  decision  variables  are  discrete,  one  popular  approach  is  to  use 
random  search.  This  has  given  rise  to  many  different  stochastic  discrete  optimization  algo¬ 
rithms,  including  the  stochastic  ruler  method  and  its  modification  ([4],  [87]),  the  random 
search  methods  ([6],  [7]),  modified  simulated  annealing  ([5]),  and  the  nested  partitions 
method  of  [76].  The  main  idea  throughout  is  to  construct  a  Markov  chain  over  the  so¬ 
lution  space  and  show  that  the  Markov  chain  settles  down  on  the  set  of  (possibly  local) 
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optimal  solutions. 


From  an  algorithmic  point  of  view  (cf.  Chapter  1.2),  the  aforementioned  approaches 
are  instance-based  techniques.  There  are  also  some  independently  developed  model-based 
methods  that  can  also  be  applied  to  stochastic  discrete  optimization  problems.  Two  most 
well-established  methods  are  the  Stochastic  Ant  Colony  Optimization  (S-ACO)  ([37])  and 
the  Cross-Entropy  (CE)  method  (cf.  e.g.,  [26],  [65],  [66],  [67],  [68])  The  S-ACO  method  is 
the  extension  of  the  original  Ant  Colony  Optimization  (ACO)  algorithm  ([29])  to  stochastic 
problems.  The  method  uses  Monte-Carlo  sampling  to  estimate  the  objective  and  is  shown 
(under  some  regularity  assumptions)  to  converge  to  the  global  optimal  solution  for  the 
stochastic  combinatorial  problems  with  probability  one.  The  CE  method  was  motivated 
by  an  adaptive  algorithm  for  estimating  probabilities  of  rare  events.  It  was  later  realized 
that  the  method  can  be  modified  to  solve  deterministic  optimization  problems  (cf.  e.g., 
[66]).  More  recently,  Rubinstein  [67]  shows  that  the  method  is  also  capable  of  handling  the 
stochastic  network  combinatorial  optimization  problems,  and  in  that  context,  establishes 
the  probability  one  convergence  of  the  algorithm. 

6.3  The  Stochastic  Model  Reference  Adaptive  Search  Method 

We  consider  the  following  optimization  problem: 

X*  G  argmaxE^[i7(x,  V')],  x  G  A  C  (6.1) 

where  X  is  the  solution  space,  which  can  be  either  continuous  or  discrete,  is  a 

deterministic,  real-valued  function,  and  ip  is  a  random  variable  (possibly  depending  on  x) 
representing  the  stochastic  effects  of  the  system.  We  let  h{x)  :=  E^[H{x,ip)],  and  assume 
that  h{x)  cannot  be  obtained  easily,  but  the  random  variable  H{x,  ip)  can  be  observed,  e.g., 
via  simulation.  We  assume  throughout  that  (6.1)  has  a  unique  global  optimal  solution. 
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i.e.,  3x*  G  X  such  that  h{x)  <  h{x*)  Vx  /  x*,  x  G  X.  We  also  assume  that  random 
samplings  can  be  done  easily  on  X,  at  least  for  a  class  of  distributions  of  interest. 

6.3.1  General  Framework 

Similar  to  MRAS,  SMRAS  uses  a  family  of  parameterized  distributions  {/(•,  6),  6  G 
0}  as  sampling  distribution  to  generate  candidate  solutions,  where  0  is  some  parameter 
space.  The  basic  algorithmic  structure  is  very  simple.  At  each  iteration  k,  suppose  we 
have  already  obtained  a  parameter  0^,  then  the  main  body  of  the  method  consists  of  the 
following  two  steps: 

1.  Generate  candidate  solutions  from  the  current  sampling  distribution  f{-,6k)- 

2.  Compute  a  new  parameter  vector  6k+i  according  to  a  specified  parameter  updating 
rule  by  using  the  candidate  solutions  generated  in  the  previous  step  in  order  to 
concentrate  the  future  search  toward  more  promising  regions. 

The  parameter  updating  rule  in  SMRAS  is  guided  by  another  sequence  of  distri¬ 
butions  {gk{-)},  called  the  reference  distribution.  These  reference  distributions  are  used 
to  express  the  desired  properties  of  the  method;  thus  we  may  often  want  to  construct 
them  such  that  they  will  have  some  nice  theoretical  properties  (however,  they  could  be 
difficult  to  handle  in  practice).  0nce  these  reference  distributions  are  specified,  then  at 
each  iteration  k,  we  look  at  the  projection  of  gk{-)  on  the  family  of  distributions  {/(•,  0)} 
and  compute  the  new  parameter  vector  9k+i  that  minimizes  the  Kullback-Leibler  (KL) 
distance 

vCmJi-M  :=  [in 

where  X  =  (Xi, . . . ,  A„)  is  a  random  vector  having  distribution  gk{-)  and  taking  values  in 
X,  and  Eg^[-]  represents  the  expectation  taken  with  respect  to  gk{-)-  Intuitively  speaking. 


161 


under  the  KL-distance  measure,  f{-,6k+i)  can  be  viewed  as  a  compact  approximation  of 
the  reference  distribution  and  thus  may  share  some  similar  properties  with  gk{')-  There¬ 


fore,  to  ensure  the  convergence  of  SMRAS,  one  basic  property  the  sequence  {gk{')}  should 
have  is  convergence.  There  could  be  many  different  ways  to  construct  such  a  convergent 
sequence  of  distributions.  When  the  performance  measure  is  deterministic,  we  have  pro¬ 
posed  in  Chapter  5.2  the  following  simple  iterative  method  for  constructing  the  reference 
distribution  {5'fc(-)}-  Let  go{x)  >0,  Vx  G  Tf  be  an  initial  probability  density/mass  func¬ 
tion  (p.d.f./p.m.f.)  on  the  solution  space  X.  Then,  at  each  iteration  k  >  1,  compute  a 
new  p.d.f./p.m.f.  by  tilting  the  old  p.d.f./p.m.f.  gk-i{x)  with  the  performance  function 
h{x)  (for  simplicity,  here  we  assume  that  h{x)  >0,  Vx  G  X),  i.e.. 


h{x)gk-i{x)  v7  ^  V 
9k{x)  =  ,  - 7^,  Vx  G  A. 


(6.2) 


h{x)gk-i{dx) 

It  is  possible  to  show  that  the  sequence  of  p.d.f.’s  {gk{-)}  constructed  above  will  converge 
to  a  p.d.f.  that  concentrates  only  on  the  set  of  optimal  solutions,  regardless  of  the  ini¬ 
tial  go{-)  used.  However,  in  the  stochastic  setting,  since  the  performance  function  h{-) 
cannot  be  evaluated  exactly,  the  iteration  procedure  given  by  (6.2)  is  no  longer  applica¬ 
ble.  Thus,  in  SMRAS,  one  key  modification  from  the  original  deterministic  approach  is 
to  use  approximations  {gk{')}  of  {5'fc(-)}  as  the  sequence  of  reference  distributions,  which 
are  constructed  based  on  the  sample  average  approximation  of  the  expected  performance 
function  h(-). 


6.3.2  Algorithm  Description 

In  SMRAS,  there  are  two  allocation  rules.  The  first  one,  denoted  by  {Nk,  k  = 
0, 1 . . .},  is  called  the  sampling  allocation  rule,  where  each  Nk  determines  the  number  of 
candidate  solutions  to  be  generated  from  the  current  sampling  distribution  at  the  kth  iter- 
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ation.  The  second  is  the  observation  allocation  rule  {M^,  /c  =  0, 1, . . .},  which  allocates 
simulation  observations  to  each  of  the  candidate  solutions  generated  at  the  kth  iteration. 
We  require  both  and  to  increase  as  the  number  of  iteration  grows  for  convergence, 
but  other  than  that,  there  is  considerable  flexibility  in  their  choices.  To  fix  ideas,  we  use 
a  parameter  a  >  1,  specified  initially,  to  control  the  rate  of  increase  in  {Ai^,  /c  =  0, 1 . . .}, 
and  leave  the  sequence  {Mk,k  =  0, 1, . . .}  as  user-specified.  When  Mk  observations  are 
allocated  to  a  solution  x  at  iteration  k,  we  use  Hj{x)  to  denote  the  jth  (independent) 
random  observation  of  H{x,il)),  and  use  Hk{x)  =  Hj{x)  to  denote  the  sample 

average  of  all  Mk  observations  made  at  x. 

The  performance  of  the  SMRAS  algorithm  depends  on  another  important  sequence 
of  quantities  {pk,k  =  0,1...}.  The  motivation  behind  the  sequence  is  to  distinguish 
“good”  samples  from  “bad”  ones  and  to  concentrate  the  computational  effort  on  the  set 
of  promising  samples.  The  sequence  {pk}  is  fully  adaptive  and  works  cooperatively  with 
the  sequence  {A'fc}-  At  successive  iterations  of  the  algorithm,  a  sequence  of  thresholds 
{ik-,  A:  =  1,  2, . . .}  is  generated  according  to  the  sequence  of  sample  (1  —  /9fc)-quantiles,  and 
only  those  samples  that  have  performances  better  than  these  thresholds  will  be  used  in 
parameter  updating.  Thus,  each  pk  determines  the  approximate  proportion  of  Nk  samples 
that  will  be  used  to  update  the  probabilistic  model  at  iteration  k. 

During  the  initialization  step  of  SMRAS,  a  small  positive  number  e  and  a  continuous 
and  strictly  increasing  function  S'(-)  :  ^  iR'*'  are  specified.  The  role  of  the  parameter 

e,  as  we  will  see  later,  is  to  filter  out  the  observation  noise.  The  function  S'(-)  is  used 
to  account  for  the  cases  where  the  sample  average  approximations  Hk{x)  are  negative  for 
some  X. 

At  each  iteration  k,  random  samples  are  drawn  from  the  density /mass  function 
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Stochastic  Model  Reference  Adaptive  Search  (SMRAS) 

•  Initialization:  Specify  po  £  (0, 1],  No  >  1,  a  >  1,  e  >  0,  an  allocation  rule  {Mk},  a  strictly 
increasing  S{-)  :  5R  ^  SR"*",  mixing  coefficients  {A*,  k  —  0, 1, . . .}  satisfying  Afe  >  Afc+i  and  A*  € 
(0, 1)  V  k,  and  an  initial  p.d.f.  f(x,  6o)  >  0  Va;  €  A.  Set  fc  <—  0. 


•  Repeat  nntil  a  specified  stopping  rule  is  satisfied: 

1.  Generate  Nk  samples  Xf , . . . ,  according  to  f{-,0k)  ■=  (1  —  \k)f{-,0k)  +  \kf{-,  do)- 

2.  Compute  the  sample  (1  —  pfe)-quantile  7fc+i(pfe,  Afe)  :=  Rfe.(f(i-pfc)jVfci),  where  [a]  is  the 
smallest  integer  greater  than  a,  and  Hk,(i)  is  the  ith  order  statistic  of  the  sequence 
[HkiX’l),  i  =  l,...,Nk}. 

3.  If  fc  =  0  or  7fe+i(pfe,  Nk)  >  %  +  e,  then  do  step  3a. 

3a.  Set  7fe+i  ^  'yk-\-i(.pk^  Nk)^  Pk-\-\  ^  Pfe?  A^fe_i_i  ^  Nk^  <  X\—p^^  where 

G  \x  :  Hk{x)  =  Hk,([(i-pk+i)Ni.]),  x  G  {Xi , . . . ,  X^V;,}}- 
else,  find  the  largest  p  G  (0,  pk)  such  that  7fe+i  (p,  Nk)  >  7fe  +  e. 

3b.  If  p  exists,  then  set  7fe+i  ^  7fc+i(P,  Nu),  pfe+i  ^  p,  Xfe+i  ^  Nu,  X^^^  ^  Xi_p. 

3c.  else  if  no  p  exists,  set  7fe+i  ^  Hk{Xl),  pk+i  ^  pk,  Nk+i  ^  \aNk~\,  X^^^  ^ 
endif 


4.  Compute  Pfe+i  as 


0fe+i  =  argmax 
eee 


1 


Nk 

E 


[S{Hk{X^))f 

f(X^,dk) 


I[Hk{X^),%+i]lnf{Xt0), 


where  /(a;,  7) 


5.  Set  fc  ^  fc  +  1. 


1 

'  (x-7  +  e)/e 

0 


if  a;  >  7, 

if  7  —  e  <  a:  <  7, 

if  a;  <  7  —  e. 


(6.3) 


Figure  6.1:  Stochastic  Model  Reference  Adaptive  Search 
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f{-,6k),  which  is  a  mixture  of  the  initial  density  f{-,0o)  and  the  density  calculated  from 
the  previous  iteration  f{-,6k)-  The  initial  density  /(•,  6*o)  can  be  chosen  according  to  some 
prior  knowledge  of  the  problem  structure;  however,  if  nothing  is  known  about  where  the 
good  solutions  are,  this  density  should  be  chosen  in  such  a  way  that  each  region  in  the 
solution  space  will  have  an  (approximately)  equal  probability  of  being  sampled.  Intuitively, 
mixing  in  the  initial  density  enables  the  algorithm  to  explore  the  entire  solution  space  and 
thus  maintain  a  global  perspective  during  the  search  process. 

At  step  2,  the  sample  (1  —  /9fc)-quantile  ^k+i  with  respect  to  /(•,  Ok)  is  calculated  by 
first  ordering  the  sample  performances  Hk{X^),  i  =  1, . . .  ,Nk  from  smallest  to  largest, 
Hk,(i)  <  <  •  •  •  <  and  then  taking  the  [(1  —  pk)Nk']th.  order  statistic.  We 

use  the  function  ^k+i{pk,  Xk)  to  emphasize  the  dependencies  of  7^+1  on  both  pk  and  Nk., 
so  that  different  sample  quantile  values  can  be  distinguished  by  their  arguments. 

Step  3  of  the  algorithm  is  used  to  construct  a  sequence  of  thresholds  {7^,  A:  =  1,  2, . . .} 
from  the  sequence  of  sample  quantiles  {jk}i  and  to  determine  the  appropriate  values  of 
the  Pk+i  and  Nk+i  to  be  used  in  subsequent  iterations.  This  is  carried  out  by  checking 
whether  the  condition  jk+i{Pk,  Xk)  >  7^  +  e  is  satisfied.  If  the  inequality  holds,  then 
both  the  current  pk  value  and  the  new  sample  size  Xk  are  satisfactory,  and  ^k+i{pk,  Xk) 
is  used  as  the  current  threshold  value.  Otherwise,  we  fix  the  sample  size  Xk  and  try  to 
find  a  smaller  p  <  pk  such  that  the  above  inequality  can  be  satisfied  with  the  new  sample 
(1  —  p)-quantile.  If  such  a  p  does  exist,  then  the  current  sample  size  Xk  is  still  deemed 
acceptable,  and  the  new  threshold  value  is  updated  by  the  sample  (1  —  /l)-quantile.  On 
the  other  hand,  if  no  such  p  can  be  found,  then  the  sample  size  Xk  is  increased  by  a 
factor  a,  and  the  new  threshold  7fc+i  is  calculated  by  using  an  additional  variable  x\.  to 
remember  the  particular  sample  that  achieves  the  previous  threshold  value  7^,,  and  then 
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simply  allocating  Mk  observations  to  X^..  It  is  important  to  note  that  in  step  4,  the  set 
|x  :  >  7fc+i  —  X  G  ,  •  •  •  could  be  empty,  since  it  could  happen  that 

all  the  random  samples  generated  at  the  current  iteration  are  much  worse  than  those 
generated  at  the  previous  iteration.  If  this  is  the  case,  then  by  the  definition  of  /(•,•,), 
the  right  hand  side  of  equation  (6.3)  will  be  equal  to  zero,  so  any  0  G  0  is  a  maximizer; 
we  define  9k+i  ■=  Ok  in  this  case.  Note  that  a  “soft”  threshold  function  /(•,  •),  as  opposed 
to  the  indicator  function,  is  used  in  parameter  updating  (cf.  equations  (6.3)).  The  reason 
for  doing  so,  as  will  be  explained  later,  is  to  smooth  out  the  noisy  observations. 

We  now  show  that  there  is  a  sequence  of  reference  models  {gk{')}  implicit  in  SM- 
RAS,  and  the  parameter  6k+i  computed  at  step  4  indeed  minimizes  the  KL-divergence 

P(fffc+i,/(-,0)). 


Lemma  6.3.1  The  parameter  6k+i  computed  at  the  kth  iteration  of  SMRAS  minimizes 
the  KL-divergence  V  (gk+i,  f{--,0)),  where 

[[5(i?fc7))]V/(a:7,)]r(gfc7),7,  +  i) 


gk+i{.x)  :=  {  ^-"=1  Vs{Hk{x’^W/f{xf,eu)\i{Hk{xL),ik+i) 

gk{x)  otherwise, 


if  [x  :  Hk{x)  >  7fc+i  -  e,  x  G  A^,  7^  0}, 

(6.4) 


7fc+i(pfc,  Afc)  if  step  3a  is  visited, 

V  A:  =  0, 1, ...,  w/iere  7fe+i  7fc+i(p,  A^,)  if  step  5b  is  visited,  CLn-d  '■=  {X^ , . . . ,  . 

Hk{x\)  if  step  3c  is  visited. 


Proof:  We  only  need  to  consider  the  case  where  {x  :  Hk{x)  >  7a:+i  —  e,  x  G  A^}  /  0^ 
since  if  this  is  not  the  case,  then  we  can  always  backtrack  and  find  agk{')  with  non-empty 
support. 

For  brevity,  we  define  Sk{Hk{x))  :=  Note  that  at  the  fcth  iteration,  the 
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K-L  divergence  between  gk+i{-)  and  f{-,0)  can  be  written  as 


'D{9k+iJ{-,0)) 

=  [ln9k+i{X)]  -  [Infix, 6)] 

MXf))I  In  fjXl  6) 

where  X  is  a  random  variable  with  distribution  5^+1  (•).  Thus  the  proof  is  completed 
by  observing  that  minimizing  E  (gk+i,  fi-,0))  is  equivalent  to  maximizing  the  quantity 
m  E.=\  SkiHkiX^))l{HkiX^),%+^)  lnfiX^,e).  I 

Remark  6.3.1  For  optimization  problems  with  finite  solution  spaees,  it  is  often  useful  to 
make  effieient  use  of  the  past  sampling  information.  This  ean  he  aehieved  by  maintaining 
a  list  of  all  sampled  eandidate  solutions  as  well  as  the  number  of  observations  made  at  eaeh 
of  these  visited  solutions,  and  then  eheek  if  a  newly  generated  solution  is  in  that  list.  If  at 
the  kth  iteration,  a  new  solution  has  already  been  visited  and,  say  Mi,  observations  have 
been  alloeated,  then  we  only  need  to  take  —  Mi  additional  observations  from  that  point. 
This  proeedure  is  often  effeetive  when  the  solution  spaee  is  relatively  small.  However, 
when  the  solution  spaee  is  large,  the  storage  and  eheeking  eost  eould  be  quite  expensive.  In 
SMRAS,  we  propose  an  alternative  approaeh:  at  eaeh  iteration  k  of  the  method,  instead  of 
remembering  all  past  samples,  we  only  keep  traek  of  those  samples  that  fall  in  the  region 
{x  :  Hk{x)  >  fik+i  Thus,  as  the  seareh  beeomes  more  and  more  eoneentrated  on 

these  regions,  the  probability  of  getting  repeated  samples  will  typieally  inerease. 

6.4  Convergence  Analysis 

For  reasons  discussed  in  Chapter  5,  we  restrict  our  discussion  to  the  so-called  natural 
exponential  family  (NEF)  (see  Definition  5.3.1),  which  works  well  in  practice,  and  for  which 


=  [lnfffc+i(A)]  - 


A  ES  sw 
1 

Nu  X,i= 
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convergence  properties  can  be  established. 

We  make  the  following  assumptions  about  the  noisy  observations  Hj{x)  and  the 
observation  allocation  rule  {M^}. 

Assumptions: 

LI.  For  any  given  e  >  0,  these  exists  a  positive  number  n*  sueh  that  for  all  n  >  n* , 

1  ” 

sup  pf  —  — /i(x)  >e]<(f(n,£), 

where  </>(-,•)  is  strietly  deereasing  in  its  first  argument  and  non-inereasing  in  its 
seeond  argument.  Moreover,  {n,  s)  ^  0  as  oo. 

L2.  For  any  e  >  0,  there  exist  positive  numbers  m*  and  n*  sueh  that  for  all  m  >  m*  and 
n  >  n*, 

^  m  1  ^ 

sup  P{  —  >  Hjix) - 7  Hjiy)  —  hix)  +  hiy)  >e)  <  (f>{m.m{m,n},£)  , 

x,y&x  ^  ^  ^  ^  ^ 

where  </)(•,•)  satisfies  the  eonditions  in  LI. 

L3.  The  observation  alloeation  rule  {M^,  A:  =  0, 1, . . .}  satisfies  >  Mk_i  V  A  =  1,  2, . . 
and  Mfc  oo  as  k  ^  oo.  Moreover,  for  any  e  >  0,  there  exist  6^  G  (0, 1)  and  /C^  >  0 

sueh  that  e)  <  {6e)^,  V  A  >  fCs,  where  •)  is  defined  as  in  LI. 

Assumption  LI  is  satished  by  many  random  sequences,  e.g.,  the  sequence  of  i.i.d. 
random  variables  with  (asymptotically)  uniformly  bounded  variance,  or  a  class  of  random 
variables  (not  necessarily  i.i.d.)  that  satisfy  the  large  deviations  principle;  please  refer 
to  [42]  for  further  details.  Assumption  L2  can  be  viewed  as  a  simple  extension  of  LI. 
Most  random  sequences  that  satisfy  LI  will  also  satisfy  L2.  For  example,  consider  the 
particular  case  where  the  sequence  Hj{x),j  =  1,2,...  is  i.i.d.  with  uniformly  bounded 
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variance  (y‘^{x)  and  E{Hj{x))  =  h{x),  V  x  G  Af.  Thus  the  variance  of  the  random  variable 
^  Hj{x)  -  i  J2j=i  Hj{y)  is  :^cr‘^{x)  +  which  is  also  uniformly  bounded  on 

X.  By  Chebyshev’s  inequality,  we  have  for  any  x,y  G  X 


-  m  1  ^ 


e2 

min{m,  nje^  ’ 


=  (/)(min{m,  n} ,  e) . 


Assumption  L3  is  a  regularity  condition  imposed  on  the  observation  allocation  rule.  L3 
is  a  mild  condition  and  is  very  easy  to  verify.  For  instance,  if  4>{n,  s)  takes  the  form 
4>{n,£)  =  where  C(e)  is  a  constant  depending  on  e,  then  the  condition  on  M^-i 
becomes  Mk-i  >  C(e)(^)^  V  fc  >  /Cg.  As  another  example,  if  Hj{x),j  =  1,2...  satisfies 
the  large  deviations  principle  and  4>{n,£)  =  then  the  condition  becomes  M^-i  > 

ln(^)/C(e)]  k,yk>  JCe. 

To  establish  the  global  convergence  of  SMRAS,  we  make  the  following  additional 
assumptions. 

Assumptions: 

Bl.  There  exists  a  eompaet  set  11  sueh  that  for  the  sequenee  of  random  variables  {xj^,  k  = 
1,2, . . .}  generated  by  SMRAS,  3AA  <  oo  w.p.l  sueh  that  {x  :  h{x)  >  —  e}  n 

A  C  n  Vfc  >  AA. 


B2.  For  any  eonstant  f,  <  h{x*),  the  set  {x  :  h{x)  >  ^}n  A  has  a  strietly  positive  Lebesgue 
or  diserete  measure. 


B3.  For  any  given  eonstant  (5  >  0,  sup^-g^^^  h{x)  <  h{x*),  where  As  :=  {x  :  ||x  —  x*||  >  5}n 
X,  and  we  define  the  supremum  over  the  empty  set  to  be  — oo. 
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BA.  For  each  point  z  <  h{x*),  there  exist  >  0  and  >  0,  such  that  ^  ^ 

L]^\z  —  z\  for  all  z  G  {z  —  A^.,  z  +  A^). 

B5.  The  maximizer  of  equation  (6.3)  is  an  interior  point  of  Q  for  all  k. 

B6.  supgg0  II  exp  {0'^r(x)}  r(x)£(x)  II  is  integrable/summable  with  respect  to  x,  where  6, 
r(-),  and  £{■)  are  defined  in  Definition  5.3.1. 

B7.  f{x,6o)  >  t)  y  X  G  X  and  /*  :=  infx^u  f{x,9o)  >  0,  where  11  is  defined  in  Bl. 

As  we  will  see,  the  sequence  {A^}  generated  by  SMRAS  converges  (cf.  the  proof 
of  Lemma  6.4.3).  Thus,  Bl  requires  that  the  search  of  SMRAS  will  eventually  end  up  in 
a  compact  set.  The  assumption  is  trivially  satisfied  if  the  solution  space  X  is  compact. 
Assumption  B2  ensures  that  the  neighborhood  of  the  optimal  solution  x*  will  be  sampled 
with  a  strictly  positive  probability.  Since  x*  is  the  unique  global  optimizer  of  /i(-),  B3  is 
satisfied  by  many  functions  encountered  in  practice.  B4  can  be  understood  as  a  locally 
Lipschitz  condition  on  [S'(-)]^;  its  suitability  will  be  discussed  later.  In  actual  implemen¬ 
tation  of  the  algorithm,  step  4  is  often  posed  as  an  unconstrained  optimization  problem, 
i.e.,  0  =  iR™',  in  which  case  B5  is  automatically  satisfied.  It  is  also  easy  to  verify  that  B6 
and  B7  are  satisfied  by  most  NEFs. 

To  show  the  convergence  of  SMRAS,  we  will  need  the  following  lemmas. 

Lemma  6.4.1  If  Assumptions  L1—L3  are  satisfied,  then  step  3a/3b  of  SMRAS  will  be 
visited  finitely  often  (f.o.)  w.p.l  as  k  ^  oo. 

Proof:  We  consider  the  sequence  k  =  1,2,...}  generated  by  SMRAS,  and  let  Ak 
be  the  event  that  step  3a/3b  is  visited  at  the  kth  iteration,  Bk  :=  — /i(A^)  <  |}, 
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and  Afc  =  {X^, . . .  ,X^^}  be  the  set  of  candidate  solutions  generated  at  the  fcth  iteration. 
Since  the  event  Ak  implies  Hk{xl_^_^)  —  Hk_i{xl,)  >  e,  we  have 

PiA,nBk)  <  p[{Hk{xl^,)-Hk.i{xl)>e}n{h{xl^,)-h{xl)<"-}) 

-  ^(  U  {{Hk{x)  -  Hk-i{y)  >  e}  n  {h{x)  -  h{y)  < 

x&Ak,y&Ak-i 

<  P  (^{Hkix)  -  Hk-i{y)  >  e}  n  {h{x)  -  h{y)  < 

xeAfc,j/eAfc_i 

<  |Afc||Afc_i|  sup  P  ({Hk{x)  -  Hk-i{y)  >  e}  n  {h{x)  -  h{y)  <  ^}) 

x^y^X  ^  2  / 

<  |Afc||Afc_i|  sup  P  (Pkix)  -  Hk-i{y)  -  h{x)  +  h{y)  > 

x,y&X  ^ 

<  |Afc||Afc_i|(/>(min{Mfc,Mfc_i},  |)  by  Assumption  L2 

^  -^o(<^£/2)^)  y  k  >  /Cj/2  by  Assumption  L3. 

Therefore, 

OO  OO 

Y,P{XnBk)<x,/2  +  Ni  Yl 

k=l  k=Kgi2 

By  the  Borel-Cantelli  lemma,  we  have 

PiAkCiBk  i.o.)  =  0. 

It  follows  that  if  Ak  happens  infinitely  often,  then  w.p.l,  B^.  will  also  happen  infinitely 
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often.  Thus 


E  ['><4+1)  -  '>(4)] 

k=l 

=  E  N^+i)  -  m4)]  +  E 

fc:  Ak  occurs  k-.  Al  occurs 

=  ^  since  if  step  3c  is  visited 

fc;  Ak  occurs 

E  NvE)  -  MV*)]  +  E  [Mv4i)  - /.(V*)] 

fc:  AkfMSk  occurs  fc:  AkfMSl  occurs 

=00  w.p.l  since  e  >  0. 

However,  this  is  a  contradiction,  since  h{x)  is  bounded  from  above  by  h{x*).  Therefore, 
w.p.l,  Ak  can  only  happen  a  finite  number  of  times.  I 

Remark  6.4.1  Lemma  6.4-1  implies  that  step  3c  of  SMRAS  will  be  visited  infinitely  often 
(i.o.)  w.p.l. 

Remark  6.4.2  Note  that  when  the  solution  spaee  X  is  finite,  the  set  Ak  will  be  finite  for 
all  k.  Thus,  Lemma  6.4.I  may  still  hold  if  we  replaee  Assumption  L3  by  some  milder 
eonditions  on  Mk.  One  sueh  eondition  is  <  00,  for  example,  when  the 

sequenee  Hj{x),j  =  1,2...  satisfies  the  large  deviations  prineiple  and  (j){n,e)  takes  the 
form  e)  =  A  partieular  observation  alloeation  rule  that  satisfies  this  eondition 

is  Mk  =  Mk-i  +  1  V  k  =  1,2, . . .. 

The  following  lemma  relates  the  sequence  of  sampling  distributions  {f{-,6k),k  = 
1,2,.. .}  to  the  sequence  of  reference  models  {gk{-)-,  k  =  1,2  . . .}  (cf.  (6.4)). 

Lemma  6.4.2  If  assumptions  B5  and  B6  hold,  then  we  have 

Ee,^,[T{X)]  =  Eg^^^[T{X)],  V/c  =  0,l,..., 
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where  and  are  the  expeetations  taken  with  respeet  to  the  p.d.f./p.m.f. 

/(•)^fc+i)  and  gk+i{-),  respeetively. 


Proof:  For  the  same  reason  as  discussed  in  the  proof  of  Lemma  6.3.1,  we  only  need  to 


consider  the  case  where  {x  :  Hk{x)  >  7fc+i  —  e,  x  G  {^i , . . . ,  /  0.  Define 

Nk 

=  where  5fc(F,(x))  := 


Since  /(•,  9)  belongs  to  the  NEF,  we  can  write 
,  Affc  _  _ 

2=1 

,  A'fc  _  _ 

^  i=i 

,_1  ./xeA’ 


Thus  the  gradient  of  Jk{9)  with  respect  to  6  can  be  expressed  as 


XeJkie) 


^  E  Sk{Hk{X^))l{Hk{X^),%+,)T{X^) 
2=1 


f  e^^^^^^r(x)£(x)!/(dx)  1 
f  e^^^(^^£(x)n(dx)  Nk 


Nk 


Y,Sk{Hk{X^))l{Hk{Xh,jk+i), 


where  the  validity  of  the  interchange  of  derivative  and  integral  above  is  guaranteed  by 


Assumption  B6  and  the  dominated  convergence  theorem.  By  setting  XgJk{0)  =  0,  it 


follows  that 

^^E^AMHk{X^))l{Hk{X^),jk+i)T{X^)  Je^"^(-)nx)£{xMdx) 

Wk  Sk{Hk{X^))l{Hk{X^),7k+i)  f  e^^n-)£ixMdx)  ’ 

which  implies  that  Eg^^^  [^(A)]  =  Eg  [F(A)]  by  the  definitions  of  gk{-)  (cf.  (6.4))  and 


Since  Ok+i  is  the  optimal  solution  of  the  problem 


arg  max  Jk{9), 
6»ee 
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we  conclude  that  [r(^)]  =  Eg^^^  [r(^)]  ,  V  /c  =  0, 1, . . .,  by  B5. 


Remark  6.4.3  Intuitively,  the  sequenee  of  regions  {x  :  Hk{x)  >  7fc+i  —  s},  /c  =  0, 1, 2  . . . 
tends  to  get  smaller  and  smaller  during  the  seareh  proeess  of  SMRAS.  Lemma  6.4-2  shows 
that  the  sequenee  of  sampling  p.d.f’s  f{-,6k+i)  is  “adapted”  to  this  sequenee  of  shrinking 
regions.  For  example,  eonsider  the  speeial  ease  where  {x  :  Hk{x)  >  7fc+i  —  is  eonvex 
and  r(a;)  =  x.  Sinee  [X]  is  a  eonvex  eombination  of  X\, . . . ,  X^^,  the  lemma  implies 
that  Eqi^_^^[X]  G  {x  :  Hk{x)  >  7^+1  —  e}.  Thus,  it  is  natural  to  expeet  that  the  random 
samples  generated  at  the  next  iteration  will  fall  in  the  region  {x  :  Hk{x)  >  jk+i  —  'with 
large  probabilities  (e.g.,  eonsider  the  normal  distribution  where  its  mean  p-k+i  =  E0^^.\X] 
is  equal  to  its  mode  value).  In  eontrast,  if  we  use  a  fixed  sampling  distribution  for  all 
iterations  (ef.  e.g.,  [74],  [89]),  then  sampling  from  this  sequenee  of  shrinking  regions  eould 
be  a  substantially  diffieult  problem  in  praetiee. 


We  now  define  a  sequence  of  (idealized)  p.d.f’s  {(/?;(•)}  as 

[5(/i(x)]^7(/i(x),7fc) 


gk+i{x)  = 


V/c  =  0,l,... 


(6.5) 


L&x  [5’(/i(x)]^/(/i(x),7fc)i/(dx) 
where  7^  :=  h{xj[).  Notice  that  since  xj^  is  a  random  variable,  gk+i{x)  is  also  random. 

The  outline  of  the  convergence  proof  is  as  follows:  First  we  establish  the  convergence 
of  the  sequence  of  p.d.f’s  {(/*:(•)})  then  we  claim  that  the  reference  p.d.f’s  {gk{-)}  are 
in  fact  the  (sample  average)  approximations  of  the  sequence  {5'fc(‘)}  by  showing  that 
Eg^[r{X)]  Egi,[T{X)]  w.p.l  as  A:  ^  00.  Thus,  the  convergence  of  the  sequence  {/(•,  Ok)} 
follows  immediately  by  applying  Lemma  6.4.2. 

The  convergence  of  the  sequence  {5'fc(-)}  is  formalized  in  the  following  lemma. 


Lemma  6.4.3  If  Assumptions  L1—L3,  B1—B3  are  satisfied,  then 

hm  i?,,  [r(x)]  =  r(x*)  w.p.l. 

/c— >00 
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Proof:  Our  proof  is  an  extension  of  the  proof  of  Theorem  5.3.1.  Let  fli  be  the  set  of  all 
sample  paths  such  that  step  3a /3b  is  visited  finitely  often,  and  let  ^2  be  the  set  of  sample 
paths  such  that  limfc^oo{/i(x)  >  7fc  —  e}  C  11.  By  Lemma  6.4.1,  we  have  P(fli)  =  1,  and 
for  each  cj  G  fli,  there  exists  a  finite  Af{uj)  >  0  such  that 

Xl^.icu)  =  Xlica)  yk>M{u;), 


which  implies  that  'yk+i{(jj)  =  7fc(w)  V /c  >  Af{uj).  Furthermore,  by  Bl,  we  have  P{^2)  =  1 
and  {h{x)  >  ^  If,  V  A:  >  Vw  G  fli  n  ^2- 

Thus,  for  each  cu  G  fli  n  ^l2,  it  is  not  difficult  to  see  from  equation  (6.5)  that  gk+i{-) 
can  be  expressed  recursively  as 


9k+i{x) 


S{h{x))gk{x) 

Eg,[S{h{X))y 


y  k  >  Af{(x), 


where  we  have  used  gk{-)  instead  of  gk{uj){-)  to  simplify  the  notation.  It  follows  that 

(S(ft(X))l  =  >  E„  [S(A(X))| ,  V  >  V(c).  (6.6) 


which  implies  that  the  sequence  {Eg^h^X)]^  /c  =  1,2, . . .}  converges  (note  that  Egyh{X)] 
is  bounded  from  above  by  h{x*)). 

Now  we  show  that  the  limit  of  the  above  sequence  is  S{h{x*)).  To  show  this,  we 
proceed  by  contradiction  and  assume  that 


lim  Eg^  [5(/i(X))]  =  5*  <  5*  :=  S{h{x*)). 

fc— >CXD 

Define  the  set  C  :=  {x  :  h{x)  >  7a/'(i.j)  —  s}  H  {x  :  S{h{x))  >  ^  }  n  X.  Since  S{-)  is 

strictly  increasing,  its  inverse  5'“^(-)  exists,  thus  C  can  be  formulated  as  C  =  {x  :  h{x)  > 
max{7_)V(^)  —  g,  )}}  Pi  X.  By  B2,  C  has  a  strictly  positive  Lebesgue/discrete 

measure. 
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Note  that  gk+i{-)  can  be  written  as 
k 


gk+i{x)  = 


S{h{x)) 


i=N{u))+l 


Eg^  [SiKX))] 


■  mn+iix),  yk>Af{uj). 


Since  lim^^oo  e  ^[S(h{x))]  ~  ^  x  G  C,  we  conclude  that 


liminf  5(fc(x)  =  oo,  V  x  G  C. 

/C— >CXD 


We  have,  by  Fatou’s  lemma, 


l=liminf  /  grfc_|_i(x)E((ix)  >  liminf  /  gk^i{x)v{dx)  >  /  liminf  g'fc+i(x)E((ix)  =  oo, 
k->oo  Jx  k->oo  Jc  Jc  k->oo 


which  is  a  contradiction.  Hence,  it  follows  that 


lim  Eg^  [5(/i(X))]  =  S*,  V  w  G  Hi  n  H2.  (6.7) 

/c— >00 

We  now  bound  the  difference  between  jr(X)]  and  r(x*).  We  have 

||E,,^jr(x)]-r(x*)||  <  [  ||r(x)-r(x*)||5fc+i(x)E(dx) 

Jxex 

=  [  \\^{x) -'i^{x*)\\gk+i{x)iy{dx),  (6.8) 

Jv 

where  P  :=  {x  :  h{x)  >  7A/'(aj)  —  e}  H  Af  is  the  support  of  5^+1  (x),  V  A:  > 

By  the  assumption  on  r(-)  in  Definition  5.3.1,  for  any  given  ^  >  0,  there  exists  a 
5  >  0  such  that  ||x  —  x*||  <6  implies  ||r(x)  —  r(x*)||  <  C-  Let  As  be  defined  as  in  B3] 
then  we  have  from  (6.8) 

[r(x)]-r(x*)|| 


< 


||r(x) -r(x*)||5rfc+i(x)i/((ix)  +  /  \\T{x)  -T{x*)\\gk+i{x)u{dx) 

JAsnv 


<  C+/  l|r(x)  -  r(x*)||5rfc+i(x)z/((ix),  \fk>J\f{uj). 

JAsnv 


(6.9) 


The  rest  of  the  proof  amounts  to  showing  that  the  second  term  in  (6.9)  is  also  bounded. 
Clearly  by  Bl,  the  term  ||r(x)  —  r(x*)||  is  bounded  on  the  set  As  n  V.  We  only  need  to 
find  a  bound  for  gk+i{x). 
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By  B3,  we  have 


sup  h{x)  <  sup  h{x)  <  h{x*). 

xGAsD'D  x&As 

Define  Ss  ■=  S*  —  5(sup3,g^^  h{x)).  And  by  the  monotonicity  of  5(-),  we  have  Ss  >  0.  It 
is  easy  to  see  that 

S{h{x))  <  S*  -  Ss,  yxeAsnV.  (6.10) 

From  (6.6)  and  (6.7),  there  exists  M{uj)  >  such  that  for  all  k  >  M{oj) 

[S{h{X))]  >S*-  ^55.  (6.11) 

Observe  that  gk+i{x)  can  be  rewritten  as 

=  n  ■  wW- 

i=Af 

Thus,  it  follows  from  (6.10)  and  (6.11)  that 
/  g*  —  gr  \  k  —  jV+1 

gk+i{x)  <  _  I gj  yx^AsDV,  yk>M{uj). 

Therefore, 

ll^gfe+i  [r(A)] -r(a;*)||  <  C+  sup  ||r(x)  -  r(x*) ||  /  gk+i{x)u{dx) 

x&AsHV  JAsDV 

y  S*  -  Ss  \  fc-A+i 

<  c+  sup  ||r(x) -r(x*)||(— — ,  yk>Af{uj) 

x&Asnv  \S*  -  ^Ss'' 

<  (i+  sup  ||r(x) - r(x*)||')C)  yk>Af{uj), 

^  x&Asnv  ' 

where  M{oj)  is  given  by  M{oj)  :=  max  \M{uj)  —  1  +  In  C/ In  }• 

Since  C  is  arbitrary,  we  have 

lim  Eg^  [r(A)]  =  r(x*),  Vw  G  Di  n  D2. 

k^oo 

And  since  P(Di  n  D2)  =  1,  the  proof  is  thus  completed.  I 
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As  mentioned  earlier,  the  rest  of  the  convergence  proof  now  amounts  to  showing 
that  Eg^\r{X)\  Eg^\r{X)\  w.p.l  as  k  ^  oo.  However,  there  is  one  more  complication: 

Since  S{-)  is  an  increasing  function  and  is  raised  to  the  kth.  power  in  both  Qk+i  and  g^+i 
(cf.  (6.4),  (6.5)),  the  associated  estimation  error  between  Hk{x)  and  h{x)  is  exaggerated. 
Thus,  even  though  we  have  limfc^ooHfc(x)  =  h{x)  w.p.l,  the  quantities  S^{Hk{x))  and 
S^{h{x))  may  still  differ  considerably  as  k  gets  large.  Therefore,  the  sequence  {Hk{x)} 
not  only  has  to  converge  to  h{x),  but  it  should  also  do  so  at  a  fast  enough  rate  in  order 
to  reduce  the  gap  between  S^{Hk{x))  and  S^{h{x)).  This  requirement  is  summarized  in 
the  following  assumption. 

Assumption  L4.  For  any  given  C  >  0,  there  exist  5*  G  (0, 1)  and  /C  >  0  sueh  that  the 
observation  alloeation  rule  {M^,  /c  =  1, 2  . . .}  satisfies 

aV(Mfc,min{Afc,^,^})  <{5*f^k>X, 

where  4>{-,  ■)  is  defined  as  in  LI,  and  are  defined  as  in  BA. 

Let  S{z)  =  for  some  positive  constant  r.  We  have  S^{z)  =  e'^^^  and  [5^(z)]'  = 
kg-e-rkz^  jg  gg_gy  venfy  that  —  kTe'^^^'‘\z  —  z\  y z  G  {z  —  Ak,z  +  A^), 

and  BA  is  satisfied  for  A^  =  1/k  and  =  re'^k.  Thus,  the  condition  in  L4  becomes 
a^(l){Mk,  C/a^k)  <  {S*)^  V  A:  >  /C,  where  C  =  Clxe'^.  We  consider  the  following  two  special 
cases  of  LA.  Let  Hfix)  be  i.i.d.  with  E{Hi{x))  =  h{x)  and  uniformly  bounded  variance 
suPajeA” ^  By  Chebyshev’s  inequality 

/  —  C  \ 

p{\H,(x)-h{x)\>X)<^^. 

Thus,  it  is  easy  to  check  that  L4  is  satisfied  by  Mk  =  for  any  constant  ^  >  1. 

As  a  second  example,  consider  the  case  where  Hi{x), . . . ,  Hn,,{x)  are  i.i.d.  with 
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E(Hi{x))  =  h{x)  and  bounded  support  [a,b].  By  the  HoefFding  inequality  ([40]) 

P{\H,{X)  -  Mx)|  >  Jj)  <  2 exp 

In  this  case,  L4  is  satisfied  by  Mk  =  for  any  constant  fx  >  1. 

Again,  as  discussed  in  Remark  6.4.2,  Assumption  L4  can  be  replaced  by  the  weaker 
condition 

oo  .  ^ 

J^,^(Mfc,min{Afc,^,^})  <  oo 
k=l  ^ 

when  the  solution  space  X  is  discrete  finite. 

Proposition  6.4.1  If  Assumptions  LI— LA  are  satisfied,  then 

lim  a^|7fc+i  -  7A:|  =  0  w.p.l. 

fc— >oo 

Proof:  Again,  we  consider  the  sequence  {^A^}  generated  by  SMRAS. 

We  have  for  any  C  >  0 

p(|7M-7<=+i|>^)  =  -  m4+.)I  >  ^) 

<  r(  u  A}) 

xsAj, 

xeAfc 

<  |Afc|  supP('|.Rfc(x)  - /r(x)|  > 

<  by  LI 

<  Nq{6*)^  y  k  >  JC  hy  la  and  the  definition  of  (/>(•,  •). 

Thus 

OO  .  OO 

J]p(|7fc+i-7fc+i|  >  <ic  +  NoY,in^  <^- 

k=l  k=K: 
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And  by  Borel-Cantelli  lemma, 


^({|7fc+i-7fc+i)|  >  i-o)  =0. 

Let  be  defined  as  before,  and  define  Os  :=  {w  :  |7fc+i  —  7fc+i|  >  C  Lo.}.  Since  for 
each  (jj  £  Oi,  there  exists  a  finite  J\f{u>)  >  0  such  that  'yk+i{uj)  =  7fc(ti;)  V /c  >  we 

have 

P(^Q;^|7fc+i  -  7fc|  >  C  i-o^ 

=  p(j%+i  -  7fc|  >  ^  i-o.  n  +  p(^|7fc+i  -  7fc|  >  ^  i-o-  n 

<  ^(OsnOi) +  P(0^) 

=  0. 

And  since  (  is  arbitrary,  the  proof  is  thus  completed.  I 

We  are  now  ready  to  state  the  main  theorem. 

Theorem  6.4.1  Let  (p  >  0  be  a  positive  eonstant  satisfying  the  eondition  that  the  set 
{x  :  S{h{x))  >  has  a  strietly  positive  Lebesgue/eounting  measure.  If  assumptions 
LI— LA,  BI—B7  are  satisfied,  and  there  exist  6  G  (0,1)  and  Ts  <  oo  sueh  that  a  > 
5]  yk>rs,  then 

lim  Ee,  [r(A)]  =  r(x*)  w.p.l,  (6.12) 

fc— >oo 

where  the  limit  above  is  eomponent-wise. 

Remark  6.4.4  By  the  monotonieity  of  S{-)  and  Assumption  B2,  it  is  easy  to  see  that  sueh 
a  positive  eonstant  ip  in  Theorem  6. 4-1  always  exists.  Moreover,  for  eontinuous  problems, 
p  ean  be  ehosen  sueh  that  pS*  ^  1;  for  diserete  problems,  if  the  eounting  measure  is  used, 
then  we  ean  ehoose  p  =  1/ S* . 
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Remark  6.4.5  Note  that  when  r(x)  is  a  one-to-one  funetion,  the  above  result  ean  be 
equivalently  written  as  (lirrifc^oo  [r(X)])  =  x* .  Also  note  that  for  some  partieular 
p.d.f.’s/p.m.f.’s,  the  solution  veetor  x  itself  will  be  a  eomponent  o/r(x)  (e.g.,  multivariate 
normal  p.d.f.).  Under  these  eireumstanees,  we  ean  disregard  the  redundant  eomponents 
and  interpret  (6.12)  as  lirrifc^oo  [X]  =  x* .  Another  speeial  ease  of  partieular  interest  is 
when  the  eomponents  of  the  random  veetor  X  =  (Xi, . . . ,  are  independent,  and  eaeh 
has  a  univariate  p.d.f /p.m.f  of  the  form 

f{xi,idi)  =  exp{xi'di  -  K{'di))i{xi),  di  C  3ft,  V  i  =  l,.  ..,n. 

In  this  ease,  sinee  the  distribution  of  the  random  veetor  X  is  simply  the  produet  of  the 
marginal  distributions,  we  have  r(x)  =  x.  Thus,  equation  (6.12)  is  again  equivalent  to 
\\m.k-,oo  Eef.[X\  =  X* ,  where  Ok  '.=  {d\, . . .  ,d^),  and  is  the  value  of  di  at  the  kth 
iteration  of  the  algorithm.  The  above  observations  indieate  that  the  eonvergenee  result  in 
Theorem  6.4. 1  is  mueh  stronger  than  it  appears  to  be. 


Proof:  For  brevity,  we  define  the  function 
Yk{Z,-f)  :=  Sk{Z)I{Z,-f),  where  Sk{Z)  =  { 


[S{h{x))]^/f{x,  9k)  if  Z  =  h{x), 
[S{Hk{x))f /J{x,  Ok)  if  Z  =  Hk{x) 


By  B7,  the  support  of  f{-,6k)  satisfies  X  C  supp{f{-,9k)}  V  fc.  Thus,  we  can  write 
^  ^  Ee^[Yk{h{X),jk)r{X)] 

Eodyk{h{X),^k)]  ’ 

where  E0^{-)  is  the  expectation  taken  with  respect  to  f{-,6k).  We  now  show  [r(^)]  ^ 
Egk+i[^{y^)]  w.p.l  as  k  ^  00.  Since  we  are  only  interested  in  the  limiting  behavior  of 
Eg^_^_^[r{X)],  from  the  definition  ofgk+i{')  (cf-  (6.4)),  it  is  sufficient  to  show  that 
E^T,Yk{Hk{Xt),%+,)r{X^) 


EfAYk{Hk{Xt),%+i) 


Egk+A^i^)]  W.p.l, 
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where  and  hereafter,  whenever  n  {x  :  Hk{x)  >  'jk+i  —  e}  =  0,  we  define  0/0  =  0.  We 
have 


E*=\n(i^fc(xf),7fc+i) 

E£'in(i?fc(^,"),7fc+i)r(xf) 


+ 


+ 


+ 


El\Yk{Hk{X^),%+,) 
E&n(gfc(^f),7fc+i)r(xf) 
E^JiYk{Hk{X^),jk+i) 
[]^E.=\nWx/),7fc)r(xf) 
I  A^fAYk{h{X^),lk) 

E^JiYk{Hk{X^),jk+inX^) 

E^AYk{Hk{X^),%+,) 

'EfAYk{Hk{x^),^knx^) 

E^JiYk{Hk{X^),jk) 
'iE.=\n(/^(xf),7fc)r(xf) 
^,j:fAYk{h{x^),ik) 


Ee,mh{X),^k)T{X)] 

Eg^[Yk{h{X),^k)] 

E^AYk{h{X^),lknX^)] 

Ef^iYk{h{X^),lk)  J 

Ee^[Yk{h{X),^k)r{X)]\ 

Eg^[Yk{h{X),^k)]  I 

EfAYkmX^),^knX^) 

E^Y,Yk{Hk{X^),^k) 

EfAYk{h{X^),lknX^)\ 
EfAYk{h{X^),lk)  j 

Ee^[Yk{h{X),^k)r{X)]\ 

Ee,[Yk{h{X),^k)]  j 


m\ 


We  now  analyze  the  terms  [i]  —  [iii\. 


(1).  We  define  8^  '■=  ^  Hk{x)  >  min(7fc+i,7fc)  —  e,  x  G  A^}.  Note  that  if  £k  =  0,  then 

[i]  =  0  by  convention.  When  7^  0;  we  let  fjk  ■=  l/maxj,  GSkSkiHkix)).  Thus 

E.=\%5fc(^(xf))7(i7fc(xf),7fc+i)r(xf)  zl\mSk{mx^))i{Hk{xt),^knxt) 


7  = 


El\mSk{Hk{x^))i{Hk{x^),%+,) 


El\  VkSk{Hk{X^))I{Hk{X^),jk) 


We  have 


< 


\Y,f,kSk{Hk{X^))I{Hk{X^),-tk+i)  -Y,VkSk{Hk{Xf))I{Hk{Xf),^k) 

i=l  i=l 

w 

J]  |/(Ffc(Xf),7fc+i)  -  I{Hk{X^),jk)  since  f)kSk{Hk{x))  <  1  Vx  G 
i=l 


<  Q^NQ-\'^k+i  —  lk\  by  the  definition  of  /(•,  •) 


0  w.p.l  by  Proposition  6.4.1. 
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Similar  argument  can  also  be  used  to  show  that  w.p.l 


i=l  i=l 

Therefore,  [i]  ^  0  as  /c  ^  oo  w.p.l. 


0. 


(2).  Define  £k  :=  {x  :  h{x)  >  7^  -  £,  x  G  A^}  U  {x  :  Hk{x)  >7^-6,  x  G  A*.}.  If  £k  =  0, 
then  [ii]  =  0  by  convention.  If  Bk  /  0,  we  let  rik:=ll  Sk{h{x)),  thus 

r . .  ^  EfJi  r?fc5fc(gfc(Xf))J(ijfc(Xf),7fc)r(Xf)  _  E&  ))7(/rpff ),  7fc)r(Xf ) 

VkSk{h{X^))mX^),lk) 

And  it  is  not  difficult  to  see  that  we  will  have  either  VkSkiHkiXj^)) I {Hk{X^) ,  ^k)  >  1 

or  J2^=i  VkSk{h{X^))I{h{X^)  ,’^k)  >  1  or  both.  Therefore,  in  order  to  prove  that  [ii]  0 
w.p.l,  it  is  sufficient  to  show  that  w.p.l 


\T!^AVkSk{mxm{mxf),lk)-Y.'^^VkSk{h{Xf))I{h{Xf),^k)\^^  and 

-  T.'^^^kSk{Kx^))mx^),ik)T{xf)\ ^  o. 


We  have 


< 


Nk 


Nk 


Y,VkSk{Hk{X^))I{Hk{X^),jk)  -Y,VkSk{KXf))I{KXf),lk) 


Nk 


Nk 


Y,VkSk{Hk{Xf))I{Hk{Xf),^k)  -Y,VkSk{KXf))I{Hk{Xf),^k) 


+\^VkSk{Kx^))i{Hk{x^),7k) -^vkSk{Kx^))i{Kx^),ik) 

i=l  i=\ 


[b] 


< 


Nk 

E 

i=l 

Nk 

E 

i=l 


\Sk{Hk{X^))  -  Sk{h{X^))\ 

Sk{h{X^)) 

\[S{Hk{Xm’^-[S{h{Xm' 

[S{h{X^)r 


I{Hk{X^),^k) 


'-I{Hk{X^),^k). 


(6.13) 
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Note  that 


P 


(  max 
\l<i<Nk 


mxt)  -  h{xf) 


<  ^(  U  {  ^ 

XSAfe 

<  y~]  P  {\Hk{x)  -  /i(x)|  >  Afc)  , 

xsAfc 

<  |Afc|  supP  - /i(a:)|  >  Afc)  , 

<  a^Nocl){Mk,Ak)  by  LI, 

<  No{5*)^  yk>IC  byL4. 


Furthermore, 


Ep 


fc=i 


(  max 
\l<i<Nk 


H,{X^)-h{Xf) 


<X  +  No^{5*f  <oo, 
k=K 


which  implies  that  P  (maxi<i<7Vj,  \Hk{X^)  —  h{X^)\  >  i.o.)  =  0  by  the  Borel-Cantelli 

lemma. 


Let  O4  :=  {u)  :  maxi<j<7Vfe  \Hk{Xf)  —  h{Xf)\  <  i.o.}.  For  each  oj  G  O4,  we  have 

Nk 

(6.13)  <  Lfc|iLfc(A(')  —  h{X^)\  for  sufficiently  large  /c,  by  P4, 

i=l 

<  a^NoLk  max  \Hk{X^)  —  h{X^)\  for  sufficiently  large /c. 


Notice  that  for  any  given  C  >  0, 

p{a"L,^maxJPfc(Xf)-MA^)|>C}<^’(  U  {\Hk{x)  -  h{x)\  >  . 

xsAj, 

And  by  using  L4  and  a  similar  argument  as  in  the  proof  for  Proposition  6.4.1,  it  is  easy 
to  show  that 

a^Lfc  ^max  \Hk{X^)  —  h{X^)\  ^  0  w.p.l 

Let  Os  :=  {uJ  :  a^Lfc maxi<i<7v,  \Hk{X^)  -  h{X^)\  ^  O}.  Since  P(04nLi5)  >  1-P{ni)- 
P(n|)  =  1,  it  follows  that  [a]  ^  0  as  /c  ^  00  w.p.l. 
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On  the  other  hand 


[b] 


< 


< 


Nk 


I{Hk{X^),^k)-I{h{X^),lk) 


2=1 

a^No-  max  \Hk{X^)  -  h{xN 
0  w.p.l  by  a  similar  argument  as  before. 


By  repeating  the  above  argument,  we  can  also  show  that 


Nk 


Nk 


Y,vkSk{mx^mMxf),ik)nxf)-Y.vkSk{Kxf))mxf),^^)T{x^ 


0  w.p.l. 


Hence,  we  have  [ii\  ^  0  as  A:  ^  oo  w.p.l. 


(3). 


m  EfJi  p^Sk{h{xt))mx^),7k)r{xt)  Eg^ 

ip^Sk{h{x))i{h{x),jk)r{x) 

Wk  i:Z\  ^^Mh(x^^))i(h(x^^),jk) 

Sfe 

'p^Sk{h{X))I{h{X),jk) 

Since  e  >  0,  we  have  7^— e  <  h{x*)—£  for  all  k.  Thus  by  B2,  the  set  {x  :  h{x)  >  7^— elndf 
has  a  strictly  positive  Lebesgue/discrete  measure  for  all  k.  It  follows  from  Fatou’s  lemma 
that 


lim  inf  Eg, 


/C— >CXD 


ip^Sk{h{X))I{h{X),jk)  >  /  limmf[(fS{h{x))fI{h{x),jk)i^{dx)>0, 

J  fc^OO 


where  the  last  inequality  follows  from  ipS{h{x))  >  1  Vx  G  {x  :  h{x)  >  max{5  ^(^),  h{x* 

£}}. 


We  denote  by  the  event  that  the  total  number  of  visits  to  step  3a /3b  is  less 
than  or  equal  to  \/k  at  the  /cth  iteration  of  the  algorithm,  and  by  Vk  the  event  that 
{h{x)  >  7a:  —  s}  C  n.  And  for  any  ^  >  0,  let  be  the  event 


1 

1% 


Nk 


Y,^’'Sk{h{X^))I{h{X^),jk)  -  Eg^  ^>^Sk{h{X))I{h{X),jk) 


2  =  1 


>C. 
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Note  that  we  have  =  0  by  Lemma  6.4.1,  and  i^(V^  i.o.)  =  0  by  ill.  Therefore, 

P{Cki.o.)  =  P{\CkC\Uk}^  {CkC\Uk}  i.o) 

=  P[Ckf^Uk  i.o) 

=  p{{Ck  n L/fc  n  Vk}  U  {Ck  r^Ukr^  v^}  i.o) 

=  P{CkfMlk<^Vk  i.o).  (6.14) 


From  B7,  it  is  easy  to  see  that  conditional  on  the  event  Vk,  the  support  [ak,bk]  of  the 
random  variable  ))/(/i(Xf), 7fc)  satisfies  [0^,6^]  ^  Moreover,  condi¬ 
tional  on  Ok  and  7^,,  . . . ,  are  i.i.d.  random  variables  with  common  density  /(•,  Ok), 

we  have  by  the  Hoeffding  inequality. 


P(4|V>„4  =  9,7.  =  7)  <  2exp(-4^) 


<  2  exp 


-2Nke>if^ 


kJ* 


ivS" 


*\2k 


)  V  A:  =  1,2,.... 


Thus, 


PiCkHVk)  =  [  P{CknVk\Ok  =  0,-/k  =  7)fe„^,{M,d-f) 

J6»,7 

P  {Ck\yk,dk  =  0,'yk  =  7)  fek,'rMd,d'y) 


'eVk 


<  2  exp 


-2Nke^if^ 


kJ* 


where  •)  is  the  joint  distribution  of  random  variables  Ok  and  7^,.  It  follows  that 

P  (CknUkCiVk)  <  P  {CkCiVklldk) 

^  V  ) 

^  ,  (-2NoefU  \  ^\ 


where  the  second  inequality  above  follows  from  the  fact  that  conditional  on  lAk,  the  total 
number  of  visits  to  step  3c  is  greater  than  k  —  Vk. 
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Moreover,  since  e  ^<l/xVx>0,  we  have 
PiCknUkHVk)  <  ,,,  ^o.of  o/i  )  ^ 


By  assumption,  we  have  ^/i  <  (^  <  1  for  all  k  >  Tg.  Thus,  there  exist  5  <  5  <  1  and 
7^  >  0  such  that  k>Tj.  Therefore, 


k  1 

1  —  I 

Vfc/fc(^5*)2 

Noeff ' 

^  aXT  ^ 

/c^— 1  k — '  1  ^ 

Thus,  we  have  by  the  Borel-Cantelli  lemma 

p {Ck  nUkCiVk  i-o.)  =  0, 

which  implies  that  P{Ck  i-o.)  =  0  by  (6.14).  And  since  ^  >  0  is  arbitrary,  we  have 
Nk 

—  ^^^Sk{h{Xf))I{h{Xf),-ik)-Eg^  [p^\{h{X))I{h{X),-ik)\  I  ^  0  w.p.l.  as  ^  oo. 
2  =  1 

The  same  argument  can  also  be  used  to  show  that 
Affe  _ 

2  =  1 

And  because  liminffc^oo  [</^^*S'a:(^(7^))/(/i(X),  7^)]  >  0,  we  have  [iii\  0  w.p.l  as 
k  00. 

Hence  the  proof  is  completed  by  applying  Lemma  6.4.2  and  6.4.3.  I 

We  now  address  some  of  the  special  cases  discussed  in  Remark  6.4.5;  the  proofs  are 
straightforward  and  hence  omitted. 

Corollary  6.4.2  (Multivariate  Normal)  For  continuous  optimization  problems  in 
if  multivariate  normal  p.d.f.  ’s  are  used  in  SMRAS,  i.e., 

fix,  Ok)  =  -  l^k)) , 
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where  6k  ■=  {^k]  ^k),  assumptions  LI  — LA,  Bl  —  Bh  are  satisfied,  and  there  exist  5  G  (0, 1) 
and  Ts  <  oo  sueh  that  a  >  [^S*]"^ /[X^f/^6]  ^  k  >%,  then 

lim  lUk  =  X* ,  and  lim  =  Onxn  w.p.l, 

fc— >00  fc— >CXD 

where  0„xn  represents  an  n-by-n  zero  matrix. 


Corollary  6.4.3  (Independent  Univariate)  If  the  eomponents  of  the  random  veetor 
X  =  {Xi, . . . ,  Xn)  are  independent,  eaeh  has  a  univariate  p.d.f/p.m.f  of  the  form 


f{xi,di)  =  exp(a;ji9j  -  K{Di))£{xi),  di  C  51?,  V  i  =  l,. 


,n, 


assumptions  LI  —  LA,  B1  —  B7  are  satisfied,  and  there  exist  6  G  (0, 1)  and  Ts  <  oo  sueh 
that  a  >  d  k  >Ts,  then 


lim  Eq^  [X]  =  X*  w.p.l,  where  6k  '.=  {d\, . . .  ,'6^). 

/c— >00 

Remark  6.4.6  (Stopping  Rule):  We  now  return  to  the  issue  of  designing  a  valid  stop¬ 
ping  rule  for  SMRAS.  In  praetiee,  this  ean  be  aehieved  in  many  different  ways.  The 
simplest  method  is  to  stop  the  algorithm  when  the  total  eomputational  budget  is  exhausted 
or  when  the  preseribed  maximum  number  of  iterations  is  reaehed.  Sinee  Proposition  6.4. 1 
indieates  that  the  sequenee  {7^,  A:  =  0, 1, . . .}  generated  by  SMRAS  eonverges,  an  alterna¬ 
tive  stopping  eriteria  eould  be  based  on  identifying  whether  the  sequenee  has  settled  down 
to  its  limit  value.  To  do  so,  we  eonsider  the  moving  average  proeess  defined  as 

follows 

k 

where  I  >  1  is  a  predefined  eonstant.  It  is  easy  to  see  that  an  unbiased  estimator  of  the 
sample  varianee  of  is 


var(Tj^'^)  := 


Ek 

i=k—l-\-l 


hi 


i{i-i) 
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which  approaches  zero  as  the  sequence  {7^}  approaches  its  limit.  Thus,  a  reasonable 
approach  in  practice  is  to  stop  the  algorithm  when  the  value  falls  below  some 

pre-specified  tolerance  level,  i.e.,  3/c  >  0  such  that  <  t,  where  r  >  0  is  the 

tolerance  level. 

6.5  Numerical  Examples 

In  this  Chapter,  we  test  the  performance  of  SMRAS  on  both  continuous  and  combi¬ 
natorial  stochastic  optimization  problems.  In  the  former  case,  we  first  illustrate  the  global 
convergence  of  SMRAS  by  testing  the  algorithm  on  two  multi-extremal  functions;  then  we 
apply  the  algorithm  to  an  inventory  control  problem.  In  the  latter  case,  we  consider  the 
problem  of  optimizing  the  buffer  allocations  in  a  tandem  queue  with  unreliable  servers, 
which  has  been  previously  studied  in  e.g.,  [3],  [84]. 

We  now  discuss  some  implementation  issues  of  SMRAS. 

1.  Since  SMRAS  was  presented  in  a  maximization  context,  the  following  slight  mod¬ 
ifications  are  required  before  it  can  be  applied  to  minimization  problems:  (i)  S'(-) 
needs  to  be  initialized  as  a  strictly  decreasing  function  instead  of  strictly  increas¬ 
ing.  Throughout  this  Chapter,  we  take  S{z)  :=  (3^  for  maximization  problems  and 
S{z)  :=  for  minimization  problems,  where  /3  >  1  is  some  predefined  constant. 
(m)  The  sample  (1  —  /9fc)-quantile  7fc+i  will  now  be  calculated  by  first  ordering  the 
sample  performances  Hk{X^),  i  =  1, . . . ,  from  largest  to  smallest,  and  then  tak¬ 
ing  the  [(1  —  pk)Nif\th  order  statistic,  {in)  The  threshold  function  should  now  be 
modified  as 

/ 

0  if  x  >  7  -I-  e, 

/(x,7)  :=  <  _  2;) /£  if  7  <  a;  <  7 -I- e, 

1  if  X  <  7. 
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(iv)  The  inequalities  at  the  beginning  of  steps  3  and  36  need  to  be  replaced  with 
7fc+i(/3fc,  Nk)  <%-e  and  7fc+i(^,  <%-£,  respectively. 

2.  Similar  to  Chapter  5,  a  smoothed  parameter  updating  procedure  (cf.  e.g.,  [26], 
[66])  is  also  used  in  actual  implementation  of  the  algorithm,  i.e.,  first  a  smoothed 
parameter  vector  9k+i  is  computed  at  each  iteration  k  according  to 

9k+i  ■■=  V  9k+i  +  (1  -  v)9k,  V  A:  =  0, 1, ... ,  and  9o  :=  9o, 

where  9k+i  is  the  parameter  vector  derived  at  step  3  of  SMRAS,  and  v  G  (0, 1]  is 
the  smoothing  parameter,  then  f{x,9k+i)  (instead  of  f{x,9k+i))  is  used  in  step  1  to 
generate  new  samples. 

6.5.1  Continuous  Optimization 

For  continuous  problems,  we  use  multivariate  normal  p.d.f’s  as  the  parameterized 
probabilistic  model.  Initially,  a  mean  vector  /Uq  and  a  covariance  matrix  Sq  are  specified; 
then  at  each  iteration  of  the  algorithm,  it  is  easy  to  see  that  the  new  parameters  Hk+i  and 
Sfc+i  are  updated  according  to  the  following  recursive  formula: 

m  S{Hk{X^))I{Hk{X^),jk+i)X^ 
W,EfAS{Hk{Xt))I{Hk{X^),%+i)  ’ 

and 

^  i  S{Hk{X^))I{Hk{X^),%+i){X^  -  ^^k+l){X^  -  fik+if 

By  Corollary  6.4.2,  the  sequence  of  mean  vectors  {^k}  will  converge  to  the  optimal  solu¬ 
tion  X*  and  the  sequence  of  covariance  matrices  {S^}  to  the  zero  matrix.  In  subsequent 
numerical  experiments,  Hk+i  will  be  used  to  represent  the  best  sample  solution  found  at 
iteration  k. 
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Global  Convergence 


To  demonstrate  the  global  convergence  of  the  proposed  method,  we  consider  the 
following  two  muti-extremal  test  functions 

(1)  Goldstein-Price  function  with  additive  noise 

Hi{x,  '(/')=  (1  +  +  X2  +  1)^(19  —  14xi  +  3xf  —  14x2  +  6x1X2  +  Sx^)) 

(30  +  (2xi  —  3x2)^(18  —  32xi  +  12xf  +  48x2  —  36xiX2  +  27x2))  +  V', 
where  x  =  (xi,X2)^,  and  is  normally  distributed  with  mean  0  and  variance  100. 
The  function  hi{x)  =  E^[Hi{x,'ip)\  has  four  local  minima  and  a  global  minimum 
/ri(0,-l)  =  3. 

(2)  A  5-dimensional  Rosenbrock  function  with  additive  noise 

4 

H2{x,  i/i)  =  ^  100(xi+i  -  xf)"^  +  {xi  -  1)'^  +  1  +  ip, 
i=l 

where  x  =  (xi, . . .  ,X5)^,  and  V’  is  normally  distributed  with  mean  0  and  variance 
100.  Its  deterministic  counterpart  h2{x)  =  Ejp[H2{x,^)]  has  the  reputation  of  being 
difficult  to  minimize  and  is  widely  used  to  test  the  performance  of  different  global 
optimization  algorithms.  The  function  has  a  global  minimum  /i2(l;l)l)l)l)  =  1- 

For  both  problems,  the  same  set  of  parameters  are  used  to  test  SMRAS:  /3  =  1.02,e  =  0.1, 
mixing  coefficient  V  k,  initial  sample  size  Nq  =  100,  pQ  =  0.9,  a  =  1.03,  and 

the  observation  allocation  rule  is  =  1.1^,  the  stopping  control  parameters  r  =  0.005 
and  I  =  10,  the  smoothing  parameter  v  =  0.2,  the  initial  mean  vector  po  is  taken  to  be  a 
n-by-1  vector  of  all  lO’s  and  Sq  is  initialized  as  a  n-by-n  diagonal  matrix  with  all  diagonal 
elements  equal  to  100. 

For  each  function,  we  performed  50  independent  simulation  runs  of  SMRAS.  The 
averaged  performance  of  the  algorithm  is  shown  in  Table  6.1,  where  Navg  is  the  average 
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total  number  of  function  evaluations  needed  to  satisfy  the  stopping  criteria,  and  H 


are  the  worst  and  best  function  values  obtained  in  50  trials,  and  H  is  the  averaged  function 
values  over  the  50  replications.  In  Figure  6.2,  we  also  plotted  the  average  function  values 
of  the  current  best  sample  solutions  for  (a)  function  Hi  after  45  iteration  of  SMRAS,  (b) 
function  H2  after  100  iterations  of  SMRAS. 


Navgistd  err) 

H* 

H{std  err) 

Hi 

5.40e-h04(3.88e-h02) 

3.05 

3.00 

3.01(1.64e-3) 

H2 

1.00e-h07(4.92e-h05) 

1.31 

1.02 

1.09(9.10e-3) 

Table  6.1:  Performance  of  SMRAS  on  two  test  functions,  based  on  50  independent  simu¬ 
lation  runs.  The  standard  errors  are  in  parentheses. 


(a)  (b) 

Figure  6.2:  Performance  of  SMRAS  on  (a)  Goldstein-price  function;  (b)  5-D  Rosenbrock 
function. 

•  An  Inventory  Control  Example 

To  further  illustrate  the  algorithm,  we  consider  an  {s,  S)  inventory  control  problem 
with  i.i.d.  exponentially  distributed  continuous  demands,  zero  order  lead  times,  full  back- 


192 


logging  of  orders,  and  linear  ordering,  holding  and  shortage  costs.  The  inventory  level  is 
periodically  reviewed,  and  an  order  is  placed  when  the  inventory  position  (on  hand  plus 
that  on  order)  falls  below  the  level  s,  and  the  amount  of  the  order  is  the  difference  between 
S  and  the  current  inventory  position.  Formally,  we  let  Dt  denote  the  demand  in  period 
t,  Xt  the  inventory  position  in  period  t,  p  the  per  period  per  unit  demand  lost  penalty 
cost,  h  the  per  period  per  unit  inventory  holding  cost,  c  the  per  unit  ordering  cost,  and  K 
the  set-up  cost  per  order.  The  inventory  position  {Xt}  evolves  according  to  the  following 
dynamics 


^t+i 


5  -  A+i 


Xt  <  S, 


Xt  —  Dt+i  Xt  >  s. 

The  goal  is  to  choose  the  thresholds  s  and  S  such  that  the  long-run  average  cost  per  period 
is  minimized,  i.e.. 


{s*,S*)  =  argmin  J(s,5)  :=  argmin  lim  Jt{s,S), 

t— >oo 

where  Jt{s,  S)  :=  j  Yli=i  [H^i  <  s}{K  +  c{S  -  Xt))  -|-  /iX+  -|-  pX~] ,  I  {•}  is  the  indica¬ 
tor  function,  =  max(0,x),  and  x~  =  max(0,— x).  Note  that  the  above  objective  cost 
function  is  convex;  however,  we  will  not  exploit  this  property  in  our  method.  The  pri¬ 
mary  reason  we  choose  this  problem  as  our  test  example  is  because  its  analytical  optimal 
solution  can  be  easily  calculated  (cf.  e.g.,  [48]). 

The  following  eight  test  cases,  taken  from  [31],  are  used  to  test  the  performance  of 
SMRAS.  The  cost  coefficients  and  the  optimal  solutions  are  given  in  Table  6.2,  each  with 
c  =  h  =  1  and  exponentially  distributed  demands  with  mean  E[D]. 

In  our  simulation  experiments,  the  initial  mean  vector  is  taken  to  be  (2000,4000)^ 
for  all  eight  cases,  and  the  covariance  matrices  are  initialized  as  diagonal  matrices  with 
all  diagonal  elements  equal  to  10^  for  cases  1  —  4  and  10®  for  cases  5  —  8.  The  other 
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Case 

E[D] 

P 

K 

J* 

s* 

5* 

1 

200 

10 

100 

740.9 

341 

541 

2 

200 

10 

10000 

2200.0 

0 

2000 

3 

200 

100 

100 

1184.4 

784 

984 

4 

200 

100 

10000 

2643.4 

443 

2443 

5 

5000 

10 

100 

17078 

11078 

12078 

6 

5000 

10 

10000 

21496 

6496 

16496 

7 

5000 

100 

100 

28164 

22164 

23164 

8 

5000 

100 

10000 

32583 

17582 

27582 

Table  6.2:  The  eight  test  cases. 


parameters  are:  (5  =  1.05,  e  =  0.1,  M  k,  Nq  =  100,  po  =  0.95,  a  =  1.05, 

Mfc  =  1.2^,  smoothing  parameter  v  =  0.3.  The  average  cost  per  period  is  estimated  by 
averaging  the  accumulated  cost  over  50  periods  after  a  warm-up  length  of  50  periods. 

Figure  6.3  shows  the  typical  performance  of  SMRAS  for  the  first  four  test  cases  when 
the  total  number  of  simulation  periods  is  set  to  10®.  The  locations  of  the  optimal  solutions 
are  marked  by  ★.  We  see  that  the  algorithm  converges  rapidly  to  the  neighborhood  of 
the  optimal  solution  in  the  first  few  iterations  and  then  spends  most  of  the  computational 
effort  in  that  small  region.  Numerical  results  for  all  eight  test  cases  are  given  in  Table  6.3. 
In  the  table,  Np  indicates  the  total  number  of  periods  (including  the  warm-up  periods) 
simulated,  and  the  entries  represent  the  averaged  function  values  J  of  the  final  sample 
solutions  obtained  for  different  choices  of  Np,  each  one  based  on  25  independent  simulation 
replications. 


194 


Case 

Np  =  10® 

Np  =  10® 

Np  =  hx  10® 

Np  =  W 

J* 

1 

1169.7(43.5) 

742.6(0.32) 

741.6(0.14) 

741.2(0.06) 

740.9 

2 

2371.6(37.8) 

2223.9(3.57) 

2202.0(0.20) 

2200.8(0.17) 

2200.0 

3 

1413.1(28.0) 

1213.8(5.90) 

1188.8(0.78) 

1185.8(0.28) 

1184.4 

4 

2709.0(13.4) 

2667.2(4.89) 

2647.2(0.61) 

2645.0(0.42) 

2643.4 

5 

18694.6(195.5) 

17390.4(48.5) 

17245.5(32.81) 

17119.3(9.25) 

17078 

6 

24001.7(340.8) 

21808.5(53.6) 

21780.0(34.00) 

21520.9(5.80) 

21496 

7 

32909.1(579.5) 

28778.5(82.2) 

28598.8(50.25) 

28290.1(33.45) 

28164 

8 

36520.0(538.0) 

32881.7(216.9) 

32860.2(52.56) 

32682.8(36.68) 

32583 

Table  6.3:  Performance  of  SMRAS  on  eight  test  cases,  each  one  based  on  25  independent 
simulation  runs.  The  standard  errors  are  in  parentheses. 

6.5.2  Combinatorial  Optimization 

To  illustrate  the  performance  of  SMRAS  on  discrete  stochastic  optimization  prob¬ 
lems,  we  consider  the  buffer  allocation  problem  in  a  service  facility  with  unreliable  servers. 
The  system  consists  of  m  servers  in  series,  which  are  separated  by  m  —  1  buffer  locations. 
Each  job  enters  the  system  from  the  first  server,  goes  through  all  intermediate  servers 
and  buffer  locations  in  a  sequential  order,  and  finally  exits  from  the  last  server.  The 
service  times  at  each  server  are  independent  exponentially  distributed  with  service  rate 
Pi,  i  =  1, . . . ,  m.  The  servers  are  assumed  to  be  unreliable,  and  are  subject  to  random 
failures.  When  a  server  fails,  it  has  to  be  repaired.  The  time  to  failure  and  the  time  for  re¬ 
pair  are  both  i.i.d.  exponentially  distributed  with  respective  rates  fi  and  r*,  i  =  1, . . . ,  m. 
A  server  is  blocked  when  the  buffer  associated  with  the  server  coming  next  to  it  is  full  and 
is  starved  when  no  jobs  are  offered  to  it.  Thus,  the  status  of  a  server  (busy/broken)  will 
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case  1 


case  2 


case  3 


Figure  6.3:  Typical  performance  of  SMRAS  on  the  first  four  test  cases  {Np  =  10®). 


affect  the  status  of  all  other  servers  in  the  system.  Figure  6.4  shows  the  four-server  case, 
where  server  S2  fails,  which  causes  server  Si  to  become  blocked  and  server  S3  to  become 
starved.  We  assume  that  the  failure  rate  of  each  server  remains  the  same,  regardless  of 
its  current  status.  Given  n  limited  buffer  spaces,  our  goal  is  to  find  an  optimal  way  of 
allocating  these  n  spaces  to  the  m  —  1  buffer  locations  such  that  the  throughput  (average 
production  rate)  is  maximized. 

buffer  1  .  buffer2  bufferd 


m 

H|§) 


S, 


S3 


Figure  6.4:  Graphical  illustration  of  the  buffer  allocation  problem. 
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When  applying  SMRAS,  we  have  used  the  same  technique  as  in  [3]  to  generate 
admissible  buffer  allocations;  the  basic  idea  is  to  choose  the  probabilistic  model  as  an 
(n  +  l)-by-(m  —  1)  matrix  P,  whose  (i,j)th  entry  specifies  the  probability  of  allocating 
f  —  1  buffer  spaces  to  the  jth  buffer  location.  Please  refer  to  their  paper  for  a  detailed 
discussion.  Once  the  admissible  allocations  are  generated,  it  is  straightforward  to  see  that 
the  entries  of  the  matrix  P  are  updated  at  the  kth  iteration  as 

^,+1  ^  ))7(gfc(Af ),  =  j} 

where  Xf ,  1  =  1,...,  Nj^  are  the  Nj^  admissible  buffer  allocations  generated,  H}^{Xf)  is  the 
average  throughput  obtained  via  simulation  when  the  allocation  Xj^  is  used,  and  X^-  =  j 
indicates  the  event  that  j  buffer  spaces  are  allocated  to  the  ith  buffer  location  (i.e.,  the 
fth  element  of  the  vector  Xf  is  equal  to  j). 

For  the  numerical  experiments,  we  consider  two  cases:  {i)  m  =  3,  n  =  1, . . . ,  10, 
Hi  =  1,  H2  =  1-2  H3  =  1-4,  failure  rates  /*  =  0.05  and  repair  rates  =  0.5  for  all  i  =  1, 2, 3; 
(a)  m  =  5,  n  =  1, . . . ,  10,  /ii  =  1,  H2  =  1-1,  Hs  =  1-2,  Hi  =  1-3,  Hb  =  1-5,  fi  =  0.05  and 
Ti  =  0.5  for  all  f  =  1, . . . ,  5. 

Apart  from  their  combinatorial  nature,  an  additional  difficulty  in  solving  these  prob¬ 
lems  is  that  different  buffer  allocation  schemes  (samples)  have  similar  performances.  Thus, 
when  only  noisy  observations  are  available,  it  could  be  very  difficult  to  discern  the  best 
allocation  from  a  set  of  candidate  allocation  schemes.  Because  of  this,  in  SMRAS  we 
choose  the  performance  function  S{-)  as  an  exponential  function  with  a  relatively  larger 
base  f3  =  10.  The  other  parameters  are  as  follows:  e  =  0.001,  =  0.01  V  k,  initial  sample 

size  Nq  =  10  for  case  (i)  and  Nq  =  20  for  case  (ii),  p  =  0.9,  a  =  1.2,  observation  allocation 
rule  Mfc  =  (1.5)^,  the  stopping  control  parameters  r  =  le  —  4  and  I  =  5,  smoothing  para¬ 
meter  V  =  0.7,  and  the  initial  P^  is  taken  to  be  a  uniform  matrix  with  each  column  sum 
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equal  to  one,  i.e.,  V  i,  j.  We  start  all  simulation  replications  with  the  system 

empty.  The  steady-state  throughputs  are  simulated  after  100  warm-up  events,  and  then 
averaged  over  the  subsequent  900  events.  Note  that  we  have  employed  the  sample  reuse 
procedure  (cf.  Remark  6.3.1)  in  actual  implementation  of  the  algorithm. 


Figure  6.5:  Performance  of  SMRAS  on  the  buffer  allocation  problem  (five-server  n  =  10 
case). 

Tables  6.4  and  6.5  give  the  performances  of  SMRAS  for  each  of  the  respective 
cases  (i)  and  (ii).  In  each  table,  Navg  is  the  averaged  number  of  simulations  over  16 
independent  trials.  Alloc  is  the  best  allocation  scheme  and  Na*  is  the  number  of  times 
the  best  allocation  found  out  of  16  runs,  T  is  the  averaged  throughput  value  calculated 
by  the  algorithm,  and  T*  represents  the  exact  optimal  solution  (cf.  [84]).  We  see  that 
in  both  cases,  SMRAS  produces  very  accurate  solutions  while  using  only  a  small  number 
of  observations.  To  illustrate  how  SMRAS  performs  on  this  problem,  we  consider  the 
five-server  n  =  10  case,  where  the  total  number  of  admissible  allocation  rules  is  286.  For 
the  286  solutions,  we  rank  them  from  the  worst  to  the  best  in  terms  of  their  performance 
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n 

Navg{std  err) 

Alloc  {Na*) 

T {std  err) 

1 

33.1(0.49) 

[1,0]  (16) 

0.634(4.06e-4) 

0.634 

2 

46.8(3.15) 

[1,1]  (16) 

0.674(6.35e-4) 

0.674 

3 

43.9(1.51) 

[2,1]  (16) 

0.711(6.11e-4) 

0.711 

4 

49.8(3.45) 

[3,1]  (14) 

0.735(6.47e-4) 

0.736 

5 

50.4(3.68) 

[3,2]  (13) 

0.758(1.06e-3) 

0.759 

6 

64.0(6.29) 

[4,2]  (12) 

0.776(1.39e-3) 

0.778 

7 

59.1(4.27) 

[5,2]  (14) 

0.792(1.04e-3) 

0.792 

8 

63.9(4.79) 

[5,3]  (10) 

0.805(1.20e-3) 

0.806 

9 

60.6(3.46) 

[6,3]  (10) 

0.817(6.53e-4) 

0.818 

10 

63.7(5.69) 

[7,3]  (12) 

0.826(9.88e-4) 

0.827 

Table  6.4:  Performance  of  SMRAS  on  the  buffer  allocation  problems  case  (i),  based  on  16 
independent  simulation  runs.  The  standard  errors  are  in  parentheses. 

and  then  equally  partition  these  solutions  into  ten  groups.  For  example,  in  Figure  6.5, 
the  interval  [0,1]  represents  the  entire  solution  space,  the  interval  [0,0.1]  represents  the 
worst  10%  solutions,  and  [0.9, 1]  represents  the  top  10%  best  solutions.  For  SMRAS,  the 
averaged  total  number  of  solutions  visited  is  102.  Figure  6.5  shows  that  among  the  total 
102  visits,  the  number  of  times  each  part  of  the  solution  space  has  been  visited,  where  the 
red  dashed  line  represents  the  95%  confidence  interval.  Obviously,  we  see  that  the  best 
top  10%  solutions  have  been  visited  significantly  more  often  than  solutions  in  other  parts 
of  the  solution  space.  Also  note  that  during  the  search  of  the  algorithm,  some  solutions 
may  be  visited  for  a  multiple  number  of  times,  the  actually  distinct  number  of  solutions 
visited  is  only  47,  only  a  small  fraction  of  the  solution  space. 


199 


n 

Navg{std  err) 

Alloc  {Na*) 

T {std  err) 

1 

1.02e-h2(7.49) 

[0,1, 0,0]  (16) 

0.523(6.79e-4) 

0.521 

2 

1.29e-h2(14.8) 

[1,1,0,0]  (16) 

0.555(3.86e-4) 

0.551 

3 

1.75e-h2(15.7) 

[1,1, 1,0]  (16) 

0.587(4.57e-4) 

0.582 

4 

2.51e-h2(25.9) 

[1,2,1,0]  (11) 

0.606(1.20e-3) 

0.603 

5 

3.37e-h2(42.0) 

[2,2,1,0]  (10) 

0.626(6.57e-4) 

0.621 

6 

4.69e-h2(55.2) 

[2,2,1,1]  (8) 

0.644(1. lOe-3) 

0.642 

7 

4.56e-h2(58.2) 

[2,2,2,!]  (7) 

0.659(1. lOe-3) 

0.659 

8 

4.45e-h2(54.9) 

[3,2,2,1]  (7) 

0.674(1. lOe-3) 

0.674 

9 

5.91e-h2(56.1) 

[3,3,2,1]  (6) 

0.689(1.39e-3) 

0.689 

10 

5.29e-h2(54.0) 

[3,3,3,1]  (8) 

0.701(1. lOe-3) 

0.701 

Table  6.5:  Performance  of  SMRAS  on  the  buffer  allocation  problem  case  (ii),  based  on  16 
independent  simulation  runs.  The  standard  errors  are  in  parentheses. 

6.6  Conclusions 

We  have  proposed  a  new  randomized  search  method,  called  Stochastic  Model  Ref¬ 
erence  Adaptive  Search  (SMRAS),  for  solving  both  continuous  and  discrete  stochastic 
global  optimization  problems.  The  method  is  shown  to  converge  asymptotically  to  the 
optimal  solution  with  probability  one.  The  algorithm  is  general,  requires  only  a  few  mild 
regularity  conditions  on  the  underlying  problem;  and  thus  can  be  applied  to  a  wide  range 
of  problems  with  little  modification.  More  importantly,  we  believe  that  the  idea  behind 
SMRAS  offers  a  general  framework  for  stochastic  global  optimization,  based  on  which  one 
can  possibly  design  and  implement  other  efficient  algorithms. 

There  are  several  input  parameters  in  SMRAS.  In  our  preliminary  numerical  exper- 
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iments,  the  choices  of  these  parameters  are  based  on  trial  and  error.  For  a  given  problem, 


how  to  determine  a  priori  the  most  appropriate  values  of  these  parameters  is  an  open  issue. 
One  research  topic  is  to  study  the  effects  of  these  parameters  on  the  performance  of  the 
method,  and  possibly  design  an  adaptive  scheme  to  choose  these  parameters  adaptively 
during  the  search  process. 

Our  current  numerical  study  with  the  algorithm  shows  that  the  objective  function 
need  not  be  evaluated  very  accurately  during  the  initial  search  phase.  Instead,  it  is  suf¬ 
ficient  to  provide  the  algorithm  with  a  rough  idea  where  the  good  solutions  are  located. 
This  has  motivated  our  research  to  use  observation  allocation  rules  with  adaptive  increas¬ 
ing  rates  during  different  search  phases.  For  instance,  during  the  initial  search  phase,  we 
could  increase  at  a  linear  rate  or  even  keep  it  at  a  constant  value;  and  exponential 
rates  will  only  be  used  during  the  later  search  phase  when  more  accurate  estimates  of  the 
objective  function  values  are  required. 

Some  other  research  topics  that  would  further  enhance  of  the  performance  of  SM- 
RAS  include  incorporating  local  search  techniques  in  the  algorithm  and  implementing  a 
paralleled  version  of  the  method. 
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Chapter  7 

Conclusions  and  Future  Research 

This  dissertation  consists  of  two  main  parts.  The  first  part  focuses  on  the  develop¬ 
ment  of  new  computational  methodologies  for  solving  Markov  Decision  Processes,  where 
we  have  proposed  two  algorithms.  The  first  algorithm  is  motivated  by  the  computational 
challenges  arising  from  settings  where  some  of  the  parameters  of  the  MDP  models  are 
either  unknown  or  cannot  be  obtained  in  a  feasible  way.  In  particular,  we  have  assumed 
that  the  underlying  system  can  be  simulated,  and  proposed  to  use  multi-armed  bandit 
models  as  efficient  tools  to  adaptively  allocate  simulation  samples  to  find  good  policies 
and/or  value  function  estimates.  We  have  shown  the  asymptotic  unbiasedness  of  our  ap¬ 
proach,  developed  a  convergence  rate  result,  and  studied  its  computational  complexity. 
The  second  algorithm  complements  current  existing  state  space  reduction  techniques,  and 
addresses  the  solution  of  MDPs  with  large  or  uncountable  action  spaces.  We  have  used 
an  evolutionary  population-based  approach,  which  combines  the  specialized  MDP  solution 
techniques  with  ideas  from  evolutionary  algorithms  for  optimization,  to  avoid  carrying  out 
an  optimization  over  the  entire  action  space.  The  convergence  of  the  resultant  algorithm 
is  proved,  and  computational  complexity  is  discussed.  We  have  also  compared  the  perfor¬ 
mance  of  our  algorithm  with  those  of  other  solution  methods,  including  the  classical  policy 
iteration  method  and  a  recently  proposed  algorithm  called  evolutionary  policy  iteration. 
Numerical  results  demonstrate  great  promise  of  the  proposed  algorithm. 

In  the  second  part  of  this  thesis,  we  have  proposed  a  new  randomized  search 
(simulation-based)  framework  for  solving  general  global  optimization  problems  with  little 
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structure.  The  framework  successfully  addresses  two  of  the  most  commonly  encountered 


difficulties  for  many  model-based  search  techniques,  i.e.,  the  problem  of  how  to  generate 
random  samples  and  the  problem  of  how  to  efficiently  update  probabilistic  models.  We 
argue  that  our  framework  can  be  easily  used  to  construct  a  class  of  randomized  global 
optimization  algorithms  with  theoretical  performance  guarantee.  Moreover,  within  this 
framework,  the  convergence  analysis  and  practical  performance  of  different  algorithm  in¬ 
stantiations  will  depend  heavily  on  a  sequence  of  independently  constructed  models  called 
reference  models.  Thus,  when  constructing  different  instantiations,  we  can  concentrate 
our  effort  on  the  design  of  these  reference  models.  We  have  provided  a  particular  in¬ 
stantiation  of  the  framework,  analyzed  its  convergence  properties,  and  carried  out  detail 
numerical  experiments  to  compare  its  performance  with  those  of  some  other  well-known 
methods  like  the  Cross-Entropy  method  and  simulated  annealing.  Both  theoretical  and 
empirical  results  demonstrate  great  potential  of  the  proposed  approach.  In  the  final  part 
of  this  thesis,  we  have  rigorously  discussed  how  to  extend  this  framework  to  stochastic 
global  optimization  problems.  Again,  our  discussion  has  been  mostly  centered  around  a 
particular  algorithm  instantiation,  but  we  note  that  our  work  can  be  easily  carried  over 
to  other  various  instantiations. 

7.1  Future  Work 

This  research  has  initiated  some  new  and  promising  ideas  in  the  field  of  decision 
making  under  uncertainty.  However,  there  are  still  many  refinements  that  can  be  explored. 
Some  possible  future  research  topics  are  outlined  as  follows. 

In  Chapter  3,  we  have  proposed  to  use  the  multi-armed  bandit  model  of  [8]  to 
adaptively  choose  which  action  to  sample  at  each  decision  epoch,  so  that  the  resulting 
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algorithm  achieves  logarithmic  regret  uniformly  over  time.  However,  this  particular  sam¬ 


pling  strategy  only  gives  us  the  asymptotic  unbiasedness  of  the  algorithm,  a  much  weaker 
result  than  (almost  sure)  convergence.  In  this  respect,  it  could  be  more  useful  to  view 
the  adaptive  multi-stage  sampling  method  as  a  simulation-based  framework  for  solving 
finite-horizon  MDPs,  and  look  for  different  bandit  models  or  even  other  different  sampling 
techniques,  so  that  it  is  possible  for  us  to  show  stronger  (almost  sure)  convergence  of  the 
resultant  algorithms.  Along  this  line,  one  possibility  is  to  use  the  model  reference  adaptive 
search  (MRAS)  proposed  in  Chapter  5  as  a  potential  sampling  technique,  and  combine  it 
with  the  AMS  framework  to  yield  yet  another  adaptive  sampling  algorithm.  An  additional 
advantage  of  using  MRAS  is  that  the  finite-action-space  assumption  in  the  original  AMS 
algorithm  can  be  relaxed;  the  action  space  can  be  infinite  or  even  uncountable. 

When  constructing  sub-MDPs  in  ERPS,  the  action  selection  distribution  V  is  cur¬ 
rently  held  fixed  throughout  the  entire  search  process.  As  discussed  in  Chapter  4.7,  one 
possible  and  important  line  of  research  is  to  update  the  underlying  action  selection  distri¬ 
bution  based  on  the  past  sampling  information  so  that  more  promising  actions  will  have 
larger  probabilities  of  being  sampled  in  the  future.  Again,  we  believe  that  MRAS  could 
be  served  as  a  promising  candidate  for  updating  these  distributions.  Thus,  by  combining 
MRAS  with  the  so-called  PICS  step,  it  is  possible  to  construct  a  new  algorithm  with 
balanced  explorative  and  exploitative  search  that  could  be  even  more  efficient  in  practice. 
Moreover,  as  mentioned  in  Chapter  4.7,  there  is  no  need  to  carry  out  an  explicit  local 
search  at  each  iteration  of  the  algorithm,  since  the  sequence  of  action  selection  distri¬ 
butions  will  be  getting  more  and  more  concentrated  on  regions  containing  high  quality 
solutions  (actions). 

Regarding  MRAS,  we  believe  that  there  are  several  interesting  future  research  di- 
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rections.  The  most  obvious  one  is  perhaps  to  explore  its  potential  applications  in  solving 
MDPs.  This  can  be  done  either  directly  in  the  sense  of  [58],  [68],  where  MDPs  are  inter¬ 
preted  as  optimization  problems  over  the  policy  spaces,  or  indirectly  along  the  lines  we 
just  discussed  in  the  previous  two  paragraphs.  Another  important  direction  is  to  study 
the  convergence  rate  and  the  computational  complexity  of  MRAS,  perhaps  for  a  class  of 
problems  (e.g.,  Lipschitz  continuous,  convex  problems)  of  interest.  The  work  of  [74]  and 
[89]  in  annealing  adaptive  search  (AAS)  (which  involves  the  use  of  Boltzmann  distribu¬ 
tions)  sheds  some  light  in  this  area.  Thus,  one  possibility,  in  particular,  is  to  investigate 
the  use  of  the  Boltzmann  distributions  as  the  reference  distributions  in  MRAS,  and  see  if 
some  nice  properties  (including  convergence,  rate,  and  complexity  in  the  context  of  AAS) 
of  the  Boltzmann  distributions  are  preserved  by  the  method.  From  a  more  general  point 
of  view,  we  can  always  construct  reference  models  that  exploit  the  structures  of  the  under¬ 
lying  problems,  and  thus  design  algorithms  tailored  to  particular  applications.  The  third 
direction  is  to  develop  new  convergent  algorithm  instantiations,  but  with  only  fixed  (sam¬ 
ple)  population  size,  perhaps  via  the  use  of  past  sampling  information.  This  is  especially 
attractive  in  the  context  of  stochastic  optimization  where  the  simulation/observation  cost 
is  expensive,  since  the  current  version  of  MRAS  requires  the  population  size  to  increase 
in  order  to  guarantee  theoretical  convergence. 
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